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COMMON SHARED MEMORY IN A VERIFICATION SYSTEM 

Related VS. Application 

This is a continuation of U.S. Patent Application Serial No, 09/918,600, filed July 30, 
5 2001, entitled, "Behavior Processor System and Method*' ; which is a continuation-in-part of U.S. 
Patent Application Serial No. 09/900,124, filed July 6, 2001, entitled "Inter-Chip 
Communication System'' ; which is a continuation-m-part of U.S. Patent Application Serial No. 
09/373,014, filed August 11, 1999, entitled "VCD-on-Demand System and Method''; which is a 
continuation-in-part of U.S. Patent Application Serial No. 09/144,222, filed August 31, 1998, 
10 entitled "Thning-Insensitive and Glitch-Free Logic System and Method" . 

BACKGROUND OF THE INVENTION 

Field of the Invention 

The present invention generally relates to electronic design automation (EDA). More 
particularly, the present invention relates to dynamically changing the evaluation period to 
15 accelerate design debug sessions. 

Description of Related Art 

In general, electronic design automation (EDA) is a computer-based tool configured in 
various workstations to provide designers with automated or semi-automated tools for designing 
and verifying user's custom circuit designs. EDA is generally used for creating, analyzing, and 

20 editing any electronic design for the purpose of simulation, emulation, prototyping, execution, or 
computmg. EDA technology can also be used to develop systems (i.e., target systems) which 
will use the user-designed subsystem or component. The end result of EDA is a modified and 
enhanced design, typically in the form of discrete integrated circuits or printed circuit boards, 
that is an improvement over the original design while maintaining the spirit of the original design. 

25 The value of software simulating a circuit design followed by hardware emulation is 

recognized in various industries that use and benefit from EDA technology. Nevertheless, 
current software simulation and hardware emulation/acceleration are cumbersome for the user 
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because of the separate and independent nature of these processes. For example, the user may 
want to simulate or debug the circuit design using software simulation for part of the tune, use 
those results and accelerate the simulation process using hardware models during other times, 
inspect various register and combinational logic values inside the circuit at select times, and 
5 return to software simulation at a later tune, all in one debug/test session. Furthermore, as 

internal register and combinational logic values change as the simulation time advances, the user 
should be able to monitor these changes even if the changes are occurring in the hardware model 
during the hardware acceleration/emulation process. 

Co-simulation arose out of a need to address some problems with the cumbersome nature 
^ZlO of using two separate and independent processes of pure software simulation and pure hardware 
Cl emulation/acceleration, and to make the overall system more user-friendly. However, co- 
,g simulators still have a number of drawbacks: (1) co-simulation systems require manual 
iT partitioning, (2) co-simulation uses two loosely coupled engines, (3) co-simulation speed is as 
^ slow as software simulation speed, and (4) co-simulation systems encounter race conditions. 
0 15 First, partitioning between software and hardware is done manually, instead of 

U automatically, fiirther burdening the user. In essence, co-simulation requires the user to partition 

the design (starting with behavior level, then RTL, and then gate level) and to test the models 
1=^ themselves among the software and hardware at very large fimctional blocks. Such a constraint 
requires some degree of sophistication by the user. 
20 Second, co-simulation systems utilize two loosely coupled and independent engines, which 

raise inter-engine synchronization, coordination, and flexibility issues. Co-simulation requires 
synchronization of two different verification engines - software sunulation and hardware 
emulation. Even though the software simulator side is coupled to the hardware accelerator side, 
only external pin-out data is available for inspection and loading. Values inside the modeled 
25 circuit at the register and combinational logic level are not available for easy inspection and 
downloading from one side to the other, limiting the utility of these co-simulator systems. 
Typically, the user may have to re-simulate the whole design if the user switches from software 
simulation to hardware acceleration and back. Thus, if the user wanted to switch between 
software simulation and hardware emulation/acceleration during a single debug session while 
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being able to inspect register and combinational logic values, co-simulator systems do not provide 
this capability. 

Third, co-simulation speed is as slow as simulation speed. Co-simulation requires 
synchronization of two different verification engines - software simulation and hardware 
5 emulation. Each of the engines has its own control mechanism for driving the simulation or 
emulation. This implies that the synchronization between the software and hardware pushes the 
overall performance to a speed that is as low as software simulation. The additional overhead to 
coordinate the operation of these two engines adds to the slow speed of co-simulation systems. 
Fourth, co-shnulation systems encounter set-up, hold time, and clock glitch problems due 
''SIO to race conditions in the hardware logic element or hardware accelerator among clock signals. 

& Co-simulators use hardware driven clocks, which may find themselves at the inputs to different 

HI 

£ logic elements at different times due to different wire line lengths. This raises the uncertainty 

l2 level of evaluation results as some logic elements evaluate data at some time period and other 

- ^ logic elements evaluate data at different time periods, when these logic elements should be 

015 evaluating the data together. 

|I Accordingly, a need exists in the industry for a system or method that addresses problems 

ji; raised above by currently known simulation systems, hardware emulation systems, hardware 

H accelerators, co-simulation, and coverification systems. 

20 SUMMARY OF THE INVENTION 

An object of the present invention is to use less hardware resources than the dedicated 
hardware cross-bar technology while achieving similar performance levels. 

Another object of the present invention is to be more resourceful than the virtual wires 
technology without the decrease in performance arisuig from the use of extra evaluation cycles 
25 for the transfer of inter-chip data. 

One embodiment of the present invention is an inter-chip communication system that 
transfers signals across FPGA chip boundaries only when these signals change values. This is 
accomplished with a series of event detectors that detect changes in signal values and packet 
schedulers which can then schedule the transfer of these changed signal values to another 
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designated chip. 

These and other embodiments are fully discussed and illustrated in the following sections 
of the specification. 



5 BRIEF DESCRIPTION OF THE FIGURES 

The above objects and description of the present invention may be better understood with 
the aid of the following text and accompanying drawings. 

FIG. 1 shows a high level overview of one embodiment of the present invention, including 
the workstation, reconfigurable hardware emulation model, emulation interface, and the target 
iglO system coupled to a PCI bus. 

FIG. 2 shows one particular usage flow diagram of the present invention. 
FIG. 3 shows a high level diagram of the software compilation and hardware 
M configuration during compile time and run time in accordance with one embodiment of the 
" ' present invention. 

y 15 FIG. 4 shows a flow diagram of the compilation process, which includes generating the 

1=^ software/hardware models and the software kernel code. 

O FIG. 5 shows the software kernel that controls the overall SEmulation system. 

FIG. 6 shows a method of mapping hardware models to reconfigurable boards through 
mapping, placement, and routing. 
20 FIG. 7 shows the connectivity matrix for the FPGA array shown in FIG. 8. 

FIG. 8 shows one embodiment of the 4x4 FPGA array and their interconnections. 
FIGS. 9(A), 9(B), and 9(C) illustrate one embodiment of the time division multiplexed 
(TDM) circuit which allows a group of wires to be coupled together in a time multiplexed fashion 
so that one pin, instead of a plurality of pins, can be used for this group of wires in a chip. FIG. 
25 9(A) presents an overview of the pin-out problem, FIG. 9(B) provides a TDM circuit for the 
transmission side, and FIG. 9(C) provides a TDM circuit for the receiver side. 

FIG. 10 shows a SEmulation system architecture in accordance with one embodiment of 
the present invention. 

FIG. 11 shows one embodiment of address pointer of the present invention. 
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FIG. 12 shows a state transition diagram of the address pointer initialization for the 
address pointer of FIG. 11. 

FIG. 13 shows one embodiment of the MOVE signal generator for derivatively generating 
the various MOVE signals for the address pointer. 
5 FIG. 14 shows the chain of multiplexed address pointers in each FPGA chip. 

FIG. 15 shows one embodiment of the multiplexed cross chip address pointer chain in 
accordance with one embodiment of the present invention. 

FIG. 16 shows a flow diagram of the clock/data network analysis that is critical for the 
software clock implementation and the evaluation of logic components in the hardware model. 
yiO FIG. 17 shows a basic building block of the hardware model in accordance with one 

embodiment of the present invention. 

in 

FIGS. 18(A) and 18(B) show the register model implementation for latches and flip-flops. 
FIG. 19 shows one embodiment of the clock edge detection logic in accordance with one 
embodiment of the present invention, 
p 15 FIG. 20 shows a four state finite state machine to control the clock edge detection logic of 

n FIG. 19 in accordance with one embodiment of the present invention. 

FIG. 21 shows the interconnection, JTAG, FPGA bus, and global signal pin designations 
M for each FPGA chip in accordance with one embodiment of the present invention. 

FIG. 22 shows one embodiment of the FPGA controller between the PCI bus and the 
20 FPGA array. 

FIG. 23 shows a more detailed illustration of the CTRL_FPGA unit and data buffer which 
were discussed with respect to FIG. 22. 

FIG. 24 shows the 4x4 FPGA array, its relationship to the FPGA banks, and expansion 
capability. 

25 FIG. 25 shows one embodiment of the hardware start-up method. 

FIG. 26 shows the HDL code for one example of a user circuit design to be modeled and 
simulated. 

FIG. 27 shows a circuit diagram that symbolically represent the circuit design of the HDL 
code in FIG. 26. 
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FIG. 28 shows the component type analysis for the HDL code of FIG, 26. 

FIG. 29 shows a signal network analysis of a structured RTL HDL code based on the 
user's custom circuit design shown in FIG. 26. 

FIG. 30 shows the software/hardware partition result for the same hypothetical example. 

FIG. 31 shows a hardware model for the same hypothetical example. 

FIG. 32 shows one particular hardware model-to-chip partition result for the same 
hypothetical example of a user's custom circuit design. 

FIG. 33 shows another particular hardware model-to-chip partition result for the same 
hypothetical example of a user's custom circuit design, 

FIG. 34 shows the logic patching operation for the same hypothetical example of a user's 
custom circuit design. 

FIGS. 35(A) to 35(D) illustrate the principle of "hops" and interconnections with two 
examples. 

FIG. 36 shows an overview of the FPGA chip used in the present invention. 

FIG. 37 shows the FPGA interconnection buses on the FPGA chip. 

FIGS. 38(A) and 38(B) show side views of the FPGA board connection scheme in 
accordance with one embodiment of the present invention. 

FIG. 39 shows a direct-neighbor and one-hop six-board interconnection layout of the 
FPGA array in accordance with one embodiment of the present invention. 

FIGS. 40(A) and 40(B) show FPGA inter-board intercoimection scheme. 

FIGS. 41(A) to 41(F) show top views of the board interconnection connectors. 

FIG. 42 shows on-board connectors and some components in a representative FPGA 

board. 

FIG. 43 shows a legend of the connectors in FIGS. 41(A) to 41(F) and 42. 

FIG. 44 shows a direct-neighbor and one-hop dual-board interconnection layout of the 
FPGA array in accordance with another embodiment of the present invention. 

FIG. 45 shows a workstation with multiprocessors in accordance with another 
embodiment of the present invention. 

FIG. 46 shows an environment in accordance with another embodiment of the present 
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invention in which multiple users share a single simulation/emulation system on a time-shared 
basis. 

FIG. 47 shows a high level structure of the Simulation server in accordance with one 
embodiment of the present invention. 
5 FIG, 48 shows the architecture of the Simulation server m accordance with one 

embodiment of the present invention. 

FIG, 49 shows a flow diagram of the Simulation server. 

FIG. 50 shows a flow diagram of the job swapping process. 

FIG. 51 shows the signals between the device driver and the reconfigurable hardware unit. 
yiO FIG. 52 illustrates the time-sharing feature of the Simulation server for handling multiple 

m jobs with different levels of priorities. 

£ FIG. 53 shows the communication handshake signals between the device driver and the 

Pi reconfigurable hardware unit. 

W FIG. 54 shows the state diagram of the conamunication handshake protocol. 

p 15 FIG. 55 shows an overview of the client-server model of the Simulation server in 

2 accordance with one embodiment of the present invention. 

FIG. 56 shows a high level block diagram of the Simulation system for implementing 
H memory mapping in accordance with one embodiment of the present invention. 

FIG. 57 shows a more detailed block diagram of the memory mapping aspect of the 
20 Simulation system with supporthig components for the memory finite state machine (MEMFSM) 
and the evaluation finite state machuie for each FPGA logic device (EVALFSMx). 

FIG. 58 shows a state diagram of a finite state machine of the MEMFSM unit in the 
CTRL_FPGA unit in accordance with one embodiment of the present invention. 

FIG. 59 shows a state diagram of a finite state machine in each FPGA chip in accordance 
25 with one embodiment of the present invention. 

FIG. 60 shows the memory read data double buffer. 

FIG. 61 shows the Simulation write/read cycle in accordance with one embodiment of the 
present invention. 

FIG. 62 shows a tuning diagram of the Simulation data transfer operation when the DMA 
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read operation occurs after the CLK EN signal. 

FIG. 63 shows a timing diagram of the Simulation data transfer operation when the DMA 
read operation occurs near the end of the EVAL period. 

FIG. 64 shows a typical user design implemented as a PCI add-on card. 
5 FIG. 65 shows a typical hardware/software coverification system using an ASIC as the 

device-under-test. 

FIG, 66 shows a typical coverification system using an emulator where the device-under- 
test is programmed in the emulator. 

FIG. 67 shows a simulation system in accordance with one embodiment of the present 
yiO invention. 

FIG. 68 shows a coverification system without external I/O devices in accordance with 
one embodiment of the present invention, where the RCC computing system contains a software 
model of the various I/O devices and the target system. 

FIG. 69 shows a coverification system with actual external I/O devices and the target 
3 15 system in accordance with another embodiment of the present invention. 

FIG. 70 shows a more detailed logic diagram of the data-in portion of the control logic in 
accordance with one embodiment of the present invention. 

FIG. 71 shows a more detailed logic diagram of the data-out portion of the control logic in 
accordance with one embodiment of the present invention. 
20 FIG. 72 shows the timing diagram of the data-in portion of the control logic, 

FIG. 73 shows the timing diagram of the data-out portion of the control logic. 
FIG. 74 shows a board layout of the RCC hardware array in accordance with one 
embodiment of the present invention, 

FIG. 75(A) shows an exemplary shift register circuit which will be used to explain the 
25 hold tune and clock glitch problems. 

FIG. 75(B) shows a timing diagram of the shift register ckcuit shown in FIG. 75(A) to 
illustrate hold time. 

FIG. 76(A) shows the same shift register circuit of FIG. 75(A) placed across multiple 
FPGA chips. 
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FIG. 76(B) shows a timing diagram of the shift register circuit shown in FIG. 76(A) to 
illustrate hold time violation. 

FIG. 77(A) shows an exemplary logic circuit which will be used to illustrate a clock glitch 
problem. 

5 FIG. 77(B) shows a timing diagram of the logic circuit of FIG. 77(A) to illustrate the 

clock glitch problem. 

FIG. 78 shows a prior art timing adjustment technique for solving the hold time violation 
problem. 

FIG. 79 shows a prior art timing resynthesis technique for solving the hold time violation 
^0 problem. 

S FIG. 80(A) shows the original latch and FIG. 80(B) shows a timing insensitive and glitch- 

£ free latch in accordance with one embodiment of the present invention. 

J FIG. 81(A) shows the original design flip-flop and FIG. 81(B) shows a timing insensitive 

and glitch-free design type flip-flop in accordance with one embodiment of the present invention. 
0 15 FIG. 82 shows a timing diagram of the trigger mechanism of the timing insensitive and 

il glitch-free latch and flip-flop in accordance with one embodiment of the present invention. 
Iz^ These figures will be discussed below with respect to several different aspects and 

t"^ embodiments of the present invention. 

FIG. 83 shows a high level view of the components of the RCC system which 
20 incorporates one embodiment of the present invention, 

FIG. 84 shows several simulation time periods to illustrate the VCD on-demand operation 
in accordance with one embodiment of the present invention. 

FIG. 85 shows a single row interconnect layout in accordance with one embodiment of the 
present invention. 

25 FIG. 86 shows a two-row interconnect layout in accordance with another embodiment of 

the present invention. 

FIG. 87 shows a three-row interconnect layout in accordance with another embodunent of 
the present invention. 

FIG. 88 shows a four-row interconnect layout in accordance with another embodiment of 
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the present invention, 

FIG. 89 shows a table that summarizes the interconnect layout scheme for a three-row 
board in accordance with one embodiment of the present invention. 

FIG. 90 shows a system diagram of the dynamic logic evaluation system and method in 
5 accordance with one embodiment of the present invention. 

FIG. 91 shows a detailed ckcuit diagram of the propagation detector in accordance with 
one embodiment of the present invention. 

FIG. 92 shows the emulation system with the clock generator and the hardware test bench 
board in accordance with one embodiment of the present invention. 
''SlO FIG. 93 shows three exemplary asynchronous clocks to illustrate the emulation system in 

f£ accordance with one embodiment of the present invention. 

4" FIG. 94 shows the clock generation scheduler for the emulation system in accordance with 

li one embodiment of the present invention. 

FIG. 95 shows the clock generation slice unit for the emulation system in accordance with 
0 15 one embodiment of the present invention. 

FIG. 96 shows the details of the clock generation slice units in the clock generation 
}™ scheduler for the emulation system in accordance with one embodiment of the present invention. 
FIG. 97 shows the event detector and packet scheduler in accordance with one 
embodiment of the present invention for inter-chip communication. 
20 FIGS. 98 A and 98B show the circuit incorporating the event detector and the packet 

scheduler at the chip boundaries in accordance with one embodiment of the present invention. 
FIG. 99 shows a high level conventional debug environment. 
FIG. 100 shows a high level co-modeling environment in accordance with one 
embodiment of the present invention. 
25 FIG. 101 shows the Behavior Processor and its interfaces in accordance with one 

embodiment of the present invention. 

FIG. 102 shows the Behavior Processor integrated with the RCC hardware system in 
accordance with one embodiment of the present invention. 

FIG. 103 shows a timing diagram of the relevant interfaces of the Behavior Processor in 
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accordance with one embodiment of the present invention. 

FIG. 104 shows another timing diagram of the relevant interfaces of the Behavior 
Processor in accordance with one embodiment of the present invention. 

FIG. 105 shows the Behavior Processor modeled as an Xtrigger processor in accordance 
with one embodiment of the present invention. 

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

This specification will describe the various embodknents of the present invention through 
and within the context of a system called "SEmulator" or "SEmulation" system. Throughout the 
fflO specification, the terms "SEmulation system," "SEmulator system," "SEmulator," or simply 
tfi "system" may be used. These terms refer to various apparatus and method embodknents in 
J-j accordance with the present invention for any combination of four operating modes: (1) software 
simulation, (2) simulation through hardware acceleration, (3) in-circuit emulation (ICE), and (4) 
s post-simulation analysis, including their respective set-up or pre-processing stages. At other 
Sl5 times, the term "SEmulation" may be used. This term refers to the novel processes described 
^ herein. 

□ Similarly, terms such as "Reconfigurable Computing (RCC) Array System" or "RCC 

computmg system" refers to that portion of the simulation/coverification system that contains the 
main processor, software kernel and the software model of the user design. Terms such as 

20 "Reconfigurable hardware array" or "RCC hardware array" refers to that portion of the 

simulation/coverification system that contains the hardware model of the user design and which 
contains the array of reconfigurable logic elements, in one embodiment. 

The specification also makes references to a "user" and a user's "circuit design" or 
"electronic design." The "user" is a person who uses the SEmulation system through its 

25 interfaces and may be the designer of a circuit or a test/debugger who played little or no part in 
the design process. The "circuit design" or "electronic design" is a custom designed system or 
component, whether software or hardware, which can be modeled by the SEmulation system for 
test/debug purposes. In many cases, the "user" also designed the "circuit design" or "electronic 
design." 
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The specification also uses the terms "wire," "wire line," "wire/bus line," and "bus." 
These terms refer to various electrically conducting lines. Each line may be a single wire 
between two points or several wires between points. These terms are interchangeable in that a 
"wire" may comprise one or more conducting lines and a "bus" may also comprise one or more 
5 conducting lines. 

This specification is presented in outline form. First, the specification presents a general 
overview of the SEmulator system, including an overview of the four operating modes and the 
hardware implementation schemes. Second, the specification provides a detailed discussion of the 
SEmulator system. In some cases, one figure may provide a variation of an embodiment shown 
■JIO in a previous figure. In these cases, like reference numerals will be used for like 
ffl components/units/processes. The outline of the specification is as follows: 

OVERVIEW 

A. SIMULATION/HARDWARE ACCELERATION MODES 

B. EMULATION WITH TARGET SYSTEM MODE 

C. POST-SIMULATION ANALYSIS MODE 

D. HARDWARE IMPLEMENTATION SCHEMES 

E. SIMULATION SERVER 

F. MEMORY SIMULATION 

G. COVERinCATION SYSTEM 
SYSTEM DESCRIPTION 

SIMULATION/HARDWARE ACCELERATION MODES 
EMULATION WITH TARGET SYSTEM MODE 
POST-SIMULATION ANALYSIS MODE 
HARDWARE IMPLEMENTATION SCHEMES 

A. OVERVIEW 

B. ADDRESS POINTER 

C. GATED DATA/CLOCK NETWORK ANALYSIS 

D. FPGA ARRAY AND CONTROL 
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20 

II. 
III. 
IV. 
V. 

25 VI. 




E. 



ALTERNATE EMBODIMENT USING DENSER FPGA CHIPS 



F. 



TIGF LOGIC DEVICES 



G. 



DYNAMIC LOGIC EVALUATION 



H. 



EMULATION SYSTEM WITH MULTIPLE ASYNCHRONOUS CLOCKS 
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I. 



INTER-CHIP COMMUNICATION 



J. 



BEHAVIOR PROCESSOR SYSTEM 



vn. 



SIMULATION SERVER 



VIII 



MEMORY SIMULATION 



IX. 



COVERIFICATION SYSTEM 



DlO X. EXAMPLES 



jg I. OVERVIEW 

The various embodiments of the present invention have four general modes of operation: 
W (1) software simulation, (2) simulation through hardware acceleration, (3) ui-circuit emulation, 
p 15 and (4) post-simulation analysis. The various embodiments include the system and method of 
[2 these modes with at least some of the following features: 

-J (1) a software and hardware model having a single tightly coupled simulation engine, a 

y= software kernel, which controls the software and hardware models cycle by cycle; (2) automatic 
component type analysis durmg the compilation process for software and hardware model 
20 generation and partitioning; (3) ability to switch (cycle by cycle) among software simulation 
mode, simulation through hardware acceleration mode, in-circuit emulation mode, and post- 
simulation analysis mode; (4) full hardware model visibility through software combinational 
component regeneration; (5) double-buffered clock modeling with software clocks and gated 
clock/data logic to avoid race conditions; and (6) ability to re-simulate or hardware accelerate the 
25 user's circuit design from any selected point in a past sunulation session. The end result is a 

flexible and fast simulator/emulator system and method with full HDL functionality and emulator 
execution performance. 
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A. SIMULATION/HARDWARE ACCELERATION MODES 
The SEmulator system, through automatic component type analysis, can model the user's 
custom circuit design in software and hardware. The entire user circuit design is modeled in 
software, whereas evaluation components (i.e., register component, combinational component) 
5 are modeled in hardware. Hardware modeling is facilitated by the component type analysis. 

A software kernel, residing in the main memory of the general purpose processor system, 
serves as the SEmulator system's main program that controls the overall operation and execution 
of its various modes and features. So long as any test-bench processes are active, the kernel 
evaluates active test-bench components, evaluates clock components, detects clock edges to 
^^%10 update registers and memories as well as propagating combinational logic data, and advances the 
^ shnulation time. This software kernel provides for the tightly coupled nature of the simulator 

engine with the hardware acceleration engine. For the software/hardware boundary, the 
il SEmulator system provides a number of I/O address spaces - REG (register), CLK (software 

clock), S2H (software to hardware), and H2S (hardware to software). 
015 The SEmulator has the capability to selectively switch among the four modes of operation. 

n The user of the system can start shnulation, stop simulation, assert input values, inspect values, 
^ single step cycle by cycle, and switch back and forth among the four different modes. For 
N example, the system can simulate the ckcuit in software for a time period, accelerate the 
shnulation through the hardware model, and return back to software simulation mode. 
20 Generally, the SEmulation system provides the user with the capability to " see" every 

modeled component, regardless of whether it's modeled in software or hardware. For a variety 
of reasons, combinational components are not as "visible" as registers, and thus, obtaining 
combinational component data is difficult. One reason is that FPGAs, which are used in the 
reconfigurable board to model the hardware portion of the user's circuit design, typically model 
25 combinational components as look-up tables (LUT), instead of actual combinational components. 
Accordingly, the SEmulation system reads register values and then regenerates combinational 
components. Because some overhead is needed to regenerate the combinational components, this 
regeneration process is not performed all the tune; rather, it is done only upon the user's request. 
Because the software kernel resides in the software side, a clock edge detection 
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mechanism is provided to trigger the generation of a so-called software clock that drives the 
enable input to the various registers in the hardware model. The timing is strictly controlled 
through a double-buffered circuit implementation so that the software clock enable signal enters 
the register model before the data to these models. Once the data input to these register models 
5 have stabilized, the software clock gates the data synchronously to ensure that all data values are 
gated together without any risk of hold-time violations. 

Software simulation is also fast because the system logs all input values and only selected 
register values/states, thus overhead is minimized by decreasing the number of I/O operations. 
The user can selectively select the logging frequency. 

glO 

S B. EMULATION WITH TARGET SYSTEM MODE 

The SEmulation system is capable of emulating the user's circuit within its target system 
environment. The target system outputs data to the hardware model for evaluation and the 
in hardware model also outputs data to the target system. Additionally, the software kernel controls 
ol5 the operation of this mode so that the user still has the option to start, stop, assert values, inspect 
i^: values, single step, and switch from one mode to another. 

p C. POST-SIMULATION ANALYSIS MODE 

Logs provide the user with a historical record of the simulation session. Unlike known 
20 simulation systems, the SEmulation system does not log every single value, internal state, or 
value change during the simulation process. The SEmulation system logs only selected values 
and states based on a logging frequency (i.e., log 1 record every N cycles). During the post- 
simulation stage, if the user wants to examine various data around point X in the just-completed 
simulation session, the user goes to one of the logged points, say logged point Y, that is closest 
25 and temporally located prior to point X. The user then simulates from that selected logged point 
Y to his desired point X to obtain simulation results. 

Also, a VCD on-demand system will be described. This VCD on-demand system allows 
the user to view any simulation target range (i.e., simulation times) on demand without 
simulation rerun. 
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D . HARDWARE IMPLEMENTATION SCHEMES 

The SEmulation system implements an array of FPGA chips on a reconfigurable board. 
Based on the hardware model, the SEmulation system partitions, maps, places, and routes each 
5 selected portion of the user's circuit design onto the FPGA chips. Thus, for example, a 4x4 
array of 16 chips may be modeling a large circuit spread out across these 16 chips. The 
mterconnect scheme allows each chip to access another chip withm 2 "jumps" or links. 

Each FPGA chip implements an address pointer for each of the I/O address spaces (i.e., 
REG, CLK, S2H, H2S). The combination of all address pointers associated with a particular 
'30 address space are chained together. So, during data transfer, word data in each chip is 
ffl sequentially selected from/to the main FPGA bus and PCI bus, one word at a time for the 
^ selected address space m each chip, and one chip at a time, until the desired word data have been 
accessed for that selected address space. This sequential selection of word data is accomplished 
^ by a propagating word selection signal. This word selection signal travels through the address 
13 15 pointer in a chip and then propagates to the address pointer in the next chip and continues on till 
the last chip or the system initializes the address pointer. 

The FPGA bus system in the reconfigurable board operates at twice the PCI bus 
M bandwidth but at half the PCI bus speed. The FPGA chips are thus separated into banks to utilize 
the larger bandwidth bus. The throughput of this FPGA bus system can track the throughput of 
20 the PCI bus system so performance is not lost by reducing the bus speed. Expansion is possible 
through piggyback boards that extend the bank length. 

In another embodiment of the present invention, denser FPGA chips are used. One such 
denser chip is the Altera 10K130V and 10K250V chips. Use of these chips alters the board 
design such that only four FPGA chips, instead of eight less dense FPGA chips (e.g., Altera 
25 lOKlOO), are used per board. 

The FPGA array in the Simulation system is provided on the motherboard through a 
particular board interconnect structure. Each chip may have up to eight sets of interconnections, 
where the interconnections are arranged according to adjacent direct-neighbor interconnects (i.e., 
N[73:0], S[73:0], W[73:0], E[73:0]), and one-hop neighbor interconnects (i.e., NH[27:0], 
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SH[27:0], XH[36:0], XH[72:37]), excluding the local bus connections, within a single board and 
across different boards. Each chip is capable of being interconnected directly to adjacent 
neighbor chips, or in one hop to a non-adjacent chip located above, below, left, and right. In the 
X direction (east-west), the array is a torus. In the Y direction (north-south), the array is a mesh. 
5 The interconnects alone can couple logic devices and other components within a single 

board. However, inter-board connectors are provided to couple these boards and interconnects 
together across different boards to carry signals between (1) the PCI bus via the motherboard and 
the array boards, and (2) any two array boards. 

A motherboard connector connects the board to the motherboard, and hence, to the PCI 
"io bus, power, and ground. For some boards, the motherboard connector is not used for direct 

connection to the motherboard. In a six-board configuration, only boards 1, 3, and 5 are directly 
connected to the motherboard while the remaining boards 2, 4, and 6 rely on their neighbor 
tl boards for motherboard connectivity. Thus, every other board is directly connected to the 

motherboard, and interconnects and local buses of these boards are coupled together via inter- 
C315 board connectors arranged solder-side to component-side. PCI signals are routed through one of 
n the boards (typically the first board) only. Power and ground are applied to the other 
Iz motherboard connectors for those boards. Placed solder-side to component-side, the various 
^ inter-board connectors allow communication among the PCI bus components, the FPGA logic 
devices, memory devices, and various Simulation system control circuits. 

20 

E. SIMULATION SERVER 

In another embodiment of the present invention, a Simulation server is provided to allow 
multiple users to access the same reconfigurable hardware unit. In one system configuration, 
multiple workstations across a network or multiple users/processes in a non-network environment 
25 can access the same server-based reconfigurable hardware unit to review/debug the same or 
different user circuit design. The access is accomplished via a time-shared process in which a 
scheduler determines access priorities for the multiple users, swaps jobs, and selectively locks 
hardware model access among the scheduled users. In one scenario, each user can access the 
server to map his/her separate user design to the reconfigurable hardware model for the first 
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time, in which case the system compiles the design to generate the software and hardware 
models, performs the clustering operation, performs place-and-route operations, generates a 
bitstream configuration file, and reconfigures the FPGA chips in the reconfigurable hardware unit 
to model the hardware portion of the user's design. When one user has accelerated his design 
5 using the hardware model and downloaded the hardware state to his own memory for software 
simulation, the hardware unit can be released for access by another user. 

The server provides the multiple users or processes to access the reconfigurable hardware 
unit for acceleration and hardware state swapping purposes. The Shnulation server includes the 
scheduler, one or more device drivers, and the reconfigurable hardware unit. The scheduler in 

:io the Simulation server is based on a preemptive round robin algorithm. The server scheduler 
includes a simulation job queue table, a priority sorter, and a job swapper. The restore and 

S playback fimction of the present invention facilitates the non-network muhiprocessing 

li environment as well as the network multi-user environment in which previous checkpoint state 
' data can be downloaded and the entire shnulation state associated with that checkpomt can be 

Ol5 restored for playback debugging or cycle-by-cycle steppmg. 

% F. MEMORY SIMULATION 

^ The Memory Shnulation or memory mapping aspect of the present invention provides an 

effective way for the Shnulation system to manage the various memory blocks associated with the 

20 configured hardware model of the user's design, which was programmed into the array of FPGA 
chips in the reconfigurable hardware unit. The memory Simulation aspect of the invention 
provides a structure and scheme where the numerous memory blocks associated with the user's 
design is mapped into the SRAM memory devices in the Simulation system mstead of inside the 
logic devices, which are used to configure and model the user's design. The memory Simulation 

25 system includes a memory state machine, an evaluation state machine, and their associated logic 
to control and interface with: (1) the mam computing system and its associated memory system, 
(2) the SRAM memory devices coupled to the FPGA buses in the Shnulation system, and (3) the 
FPGA logic devices which contain the configured and programmed user design that is being 
debugged. The operation of the memory Shnulation system in accordance with one embodunent 
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of the present invention is generally as follows. The Simulation write/read cycle is divided into 
three periods - DMA data transfer, evaluation, and memory access. 

The FPGA logic device side of the memory Simulation system includes an evaluation state 
machine, an FPGA bus driver, and a logic interface for each memory block N to interface with 
5 the user's own memory interface in the user design to handle: (1) data evaluations among the 
FPGA logic devices, and (2) write/read memory access between the FPGA logic devices and the 
SRAM memory devices. In conjunction with the FPGA logic device side, the FPGA I/O 
controller side includes a memory state machine and interface logic to handle DMA, write, and 
read operations between: (1) main computing system and SRAM memory devices, and (2) FPGA 
'^^10 logic devices and the SRAM memory devices. 

2 G. COVERIFICATION SYSTEM 

/2 One embodiment of the present invention is a coverification system that includes a 

^ reconfigurable computing system (hereinafter "RCC computing system") and a reconfigurable 
□ 15 computing hardware array (hereinafter "RCC hardware array"). In some embodiments, the 
^ target system and the external I/O devices are not necessary since they can be modeled in 
HI software. In other embodiments, the target system and the external I/O devices are actually 
H= coupled to the coverification system to obtain speed and use actual data, rather than simulated test 
bench data. Thus, a coverification system can incorporate the RCC computing system and RCC 
20 hardware array along with other functionality to debug the software portion and hardware portion 
of a user's design while using the actual target system and/or I/O devices. 

The RCC computing system also contains clock logic (for clock edge detection and 
software clock generation), test bench processes for testing the user design, and device models 
for any I/O device that the user decides to model in software instead of using an actual physical 
25 I/O device. Of course, the user may decide to use actual I/O devices as well as modeled I/O 

devices in one debug session. The software clock is provided to the external interface to ftinction 
as the external clock source for the target system and the external I/O devices. The use of this 
software clock provides the synchronization necessary to process incoming and outgoing data. 
Because the RCC computing system-generated software clock is the time base for the debug 
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session, simulated and hardware-accelerated data are synchronized with any data that is delivered 
between the coverification system and the external interface. 

When the target system and the external I/O devices are coupled to the coverification 
system, pin-out data must be provided between the coverification system and its external 
5 interface. The coverification system contains a control logic that provides traffic control between: 
(1) the RCC computing system and the RCC hardware array, and (2) the external interface 
(which are coupled to the target system and the external I/O devices) and the RCC hardware 
array. Because the RCC computing system has the model of the entire design in software, 
including that portion of the user design modeled in the RCC hardware array, the RCC 
'::10 computing system must also have access to all data that passes between the external interface and 
^ the RCC hardware array. The control logic ensures that the RCC computing system has access to 
these data. 

W 11. SYSTEM DESCRIPTION 

C3 15 FIG. 1 shows a high level overview of one embodiment of the present invention. A 

it workstation 10 is coupled to a reconfigurable hardware model 20 and emulation interface 30 via 
^ PCI bus system 50. The reconfigurable hardware model 20 is coupled to the emulation mterfece 
H- 30 via PCI bus 50, as well as cable 61. A target system 40 is coupled to the emulation interface 
30 via cables 60. In other embodiments, the in-circuit emulation set-up 70 which comprises the 
20 emulation interface 30 and target system 40 (as shown in the dotted line box) are not provided in 
this set-up when emulation of the user's circuit design within the target system's environment is 
not desired during a particular test/debug session. Without the in-circuit emulation set-up 70, the 
reconfigurable hardware model 20 communicates with the workstation 10 via the PCI bus 50. 

In combination with the in-circuit emulation set-up 70, the reconfigurable hardware model 
25 20 imitates or mimics the user's circuit design of some electronic subsystem in the target system. 
To ensure the correct operation of the user's circuit design of the electronic subsystem within the 
target system's environment, input and output signals between the target system 40 and the 
modeled electronic subsystem must be provided to the reconfigurable hardware model 20 for 
evaluation. Hence, the input and output signals of the target system 40 to/from the 
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reconfigurable hardware model 20 are delivered via cables 60 through the emulation interface 30 
and the PCI bus 50. Alternatively, input/output signals of the target system 40 can be delivered 
to the reconfigurable hardware model 20 via emulation interface 30 and cables 6L 

The control data and some substantive simulation data pass between the reconfigurable 
5 hardware model 20 and the workstation 10 via the PCI bus 50. Indeed, the workstation 10 runs 
the software kernel that controls the operation of the entire SEmulation system and must have 
access (read/write) to the reconfigurable hardware model 20. 

A workstation 10 complete with a computer, keyboard, mouse, monitor and appropriate 
bus/network interface allows a user to enter and modify data describing the circuit design of an 
^10 electronic system. Exemplary workstations include a Sun Microsystems SPARC or ULTRA- 
a SPARC workstation or an Intel/Microsoft-based computing station. As known to those ordinarily 
^ skilled in the art, the workstation 10 comprises a CPU 11, a local bus 12, a host/PCI bridge 13, 
^ memory bus 14, and main memory 15, The various software simulation, simulation by hardware 
IH acceleration, in-circuit emulation, and post-simulation analysis aspects of the present invention 
p 15 are provided in the workstation 10, reconfigurable hardware model 20, and emulation interface 
I_T 30. The algorithm embodied in software is stored in main memory 15 during a test/debug 
session and executed through the CPU 11 via the workstation's operating system. 

As known to those ordinarily skilled in the art, after the operating system is loaded into 
the memory of workstation 10 by the start-up firmware, control passes to its initialization code to 
20 set up necessary data structures, and load and initialize device drivers. Control is then passed to 
the command line interpreter (CLI), which prompts the user to indicate the program to be run. 
The operating system then determines the amount of memory needed to run the program, locates 
the block of memory, or allocates a block of memory and accesses the memory either directly or 
through BIOS. After completion of the memory loading process, the application program begins 
25 execution. 

One embodhnent of the present invention is a particular application program for 
SEmulation. During the course of its execution, the application program may require numerous 
services from the operating system, including, but not limited to, reading fi-om and writing to 
disk files, performing data communications, and interfacing with the display /key board/mouse. 
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The workstation 10 has the appropriate user interface to allow the user to enter the circuit 
design data, edit the circuit design data, monitor the progress of simulations and emulations while 
obtaining results, and essentially control the simulation and emulation process. Although not 
shown in FIG. 1, the user interface includes user-accessible menu-driven options and command 
5 sets which can be entered with the keyboard and mouse and viewed with a monitor. Typically, 
the user uses a computing station 80 with a keyboard 90. 

The user typically creates a particular circuit design of an electronic system and enters a 
HDL (usually structured RTL level) code description of his designed system into the workstation 
10. The SEmulation system of the present invention performs component type analysis, among 
yiO other operations, for partitioning the modeling between software and hardware. The SEmulation 

system models behavior, RTL, and gate level code m software. For hardware modeling, the 
^ system can model RTL and gate level code; however, the RTL level must be synthesized to gate 
J2 level prior to hardware modeling. The gate level code can be processed directly into usable 
ill source design database format for hardware modeling. Using the RTL and gate level codes, the 
pl5 system automatically performs component type analysis to complete the partition step. Based on 
2 the partitioning analysis during software compile time, the system maps some portion of the 
f^' circuit design into hardware for fast simulation via hardware acceleration. The user can also 
H couple the modeled circuit design to the target system for real environment in-circuit emulation. 
Because the software simulation and the hardware acceleration engines are tightly coupled, 
20 through the software kernel, the user can then simulate the overall circuit design using software 
simulation, accelerate the test/debug process by using the hardware model of the mapped circuit 
design, return to the simulation portion, and return to the hardware acceleration until the 
test/debug process is complete. The ability to switch between software simulation and hardware 
acceleration cycle-by-cycle and at will by the user is one of the valuable features of this 
25 embodiment. This feature is particularly useful in the debug process by allowing the user to go 
to a particular point or cycle very quickly using the hardware acceleration mode and then using 
software simulation to examine various points thereafter to debug the circuit design. Moreover, 
the SEmulation system makes all components visible to the user whether the internal realization 
of the component is in hardware or software. The SEmulation system accomplishes this by 
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reading the register values from the hardware model and then rebuilding the combinational 
components using the software model when the user requests such a read. These and other 
features will be discussed more fully later in the specification. 

The workstation 10 is coupled to a bus system 50. The bus system can be any available 
5 bus system that allows various agents, such as the workstation 10, reconfigurable hardware model 
20, and emulation interface 30, to be operably coupled together. Preferably, the bus system is 
fast enough to provide real-time or near real-time results to the user. One such bus system is the 
bus system described in the Peripheral Component Interconnect (PCI) standard, which is 
incorporated herein by reference. Currently, revision 2.0 of the PCI standard provides for a 33 
^iO MHz bus speed. Revision 2.1 provides support for 66 MHz bus speed. Accordingly, the 
vD workstation 10, reconfigurable hardware model 20, and emulation interface 30 may comply with 
.™ the PCI standard. 

, 2 In one embodiment, communication between the workstation 10 and the reconfigurable 

VI hardware model 20 is handled on the PCI bus. Other PCI-compliant devices may be found in this 
015 bus system. These devices may be coupled to the PCI bus at the same level as the workstation 
2 10, reconfigurable hardware model 20, and emulation interface 30, or other levels. Each PCI bus 

at a different level, such as PCI bus 52, is coupled to another PCI bus level, such as PCI bus 50, 
1^ if it exists at all, through a PCI-to-PCI bridge 51. At PCI bus 52, two PCI devices 53 and 54 
may be coupled therewith. 
20 The reconfigurable hardware model 20 comprises an array of field-programmable gate 

array (FPGA) chips that can be programmably configured and reconfigured to model the 
hardware portion of the user's electronic system design. In this embodunent, the hardware model 
is reconfigurable; that is, it can reconfigure its hardware to suit the particular computation or user 
circuit design at hand. If, for example, many adders or multiplexers are required, the system is 
25 configured to include many adders and multiplexers. As other computing elements or functions are 
needed, they may also be modeled or formed ni the system. In this way, the system can be 
optimized to perform specialized computations or logic operations. Reconfigurable systems are also 
flexible, so that users can work around minor hardware defects that arise during manufacture, 
testing, or use. In one embodiment, the reconfigurable hardware model 20 comprises a two- 
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dimensional array of computing elements consisting of FPGA chips to provide the computational 

resources for various user circuit designs and applications. More details on the hardware 

configuration process will be provided. 

Two such FPGA chips include those sold by Altera and Xilinx. In some embodiments, 
5 the reconfigurable hardware model is reconfigurable via the use of field programmable devices. 

However, other embodiments of the present invention may be implemented using application 

specific integrated circuit (ASIC) technology. Still other embodiments may be in the form of a 

custom integrated circuit. 

In a typical test/debug scenario, reconfigurable devices will be used to simulate/emulate 
■pJlO the user's circuit design so that appropriate changes can be made prior to actual prototype 
a manufacturing. In some other instances, however, an actual ASIC or custom integrated circuit 

can be used, although this deprives the user of the ability to quickly and cost-effectively change a 
gj: possibly non-functional circuit design for re-simulation and re-emulation. At times, though, such 
W an ASIC or custom IC has already been manufactured and readily available so that emulation with 
0 15 an actual non-reconfigurable chip may be preferable. 

n In accordance with the present invention, the software in the workstation, along with its 

I; integration with an external hardware model, provides a greater degree of flexibility, control, and 
performance for the end user over existing systems. To run the simulation and emulation, a 
model of the circuit design and the relevant parameters (e.g., input test-bench stimulus, overall 
20 system output, intermediate results) are determined and provided to the simulation software 
system. The user can use either schematic capture tools or synthesis tools to define the system 
circuit design. The user starts with a circuit design of an electronic system, usually in draft 
schematic form, which is then converted to HDL form using synthesis tools. The HDL can also 
be dkectly written by the user. Exemplary HDL languages include Verilog and VHDL; 
25 however, other languages are also available. A circuit design represented in HDL comprises 
many concurrent components. Each component is a sequence of code which either defines the 
behavior of a circuit element or controls the execution of the simulation. 

The SEmuIation system analyzes these components to determine their component types 
and the compiler uses this component type information to build different execution models in 
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software and hardware. Thereafter, the user can use the SEmulation system of the present 
invention. The designer can verify the accuracy of the circuit through simulation by applying 
various stimuli such as input signals and test vector patterns to the shnulated model. If, during 
the simulation, the circuit does not behave as planned, the user re-defmes the circuit by 
5 modifying the circuit schematic or the HDL file. 

The use of this embodiment of the present invention is shown in the flow chart of FIG. 2. 
The algorithm starts at step 100. After loading the HDL file mto the system, the system 
compiles, partitions, and maps the circuit design to appropriate hardware models. The 
compilation, partition, and mapping steps are discussed in more detail below. 
2l0 Before the simulation runs, the system must run a reset sequence to remove all the 

^ unknown **x" values in software before die hardware acceleration model can function. One 
p embodiment of the present invention uses a 2-bit wide data path to provide a 4-state value for the 
2 bus signal - "00" is logic low, "01" is logic high, " 10" is "z," and "11" is "x." As known to 
^ those ordinarily skilled in the art, software models can deal with "0," "1," "x"(bus conflicts or 
3 15 unknown value), and "z" (no driver or high impedance). In contrast, hardware cannot deal with 
the unknown values "x," so the reset sequence, which varies depending on the particular 
applicable code, resets the register values to all "0" or all "1." 

At step 105, the user decides whether to simulate the circuit design. Typically, a user will 
start the system with software simulation first. Thus, if the decision at step 105 resolves to 
20 "YES," software simulation occurs at step 110. 

The user can stop the simulation to inspect values as shown in step 115. Indeed, the user 
can stop the simulation at any time during the test/debug session as shown by the dotted lines 
extending from step 115 to various nodes in the hardware acceleration mode, ICE mode, and 
post-simulation mode. Executing step 115 takes the user to step 160. 
25 After stopping, the system kernel reads back the state of hardware register components to 

regenerate the entire software model, including the combinational components, if the user wants 
to inspect combinational component values. After restoring the entire software model, the user 
can inspect any signal value in the system. After stopping and inspection, the user can continue 
to run in simulation only mode or hardware model acceleration mode. As shown in the flow 
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chart, step 115 branches to the stop/value inspect routine. The stop/value inspect routine starts at 
step 160. At step 165, the user must decide whether to stop the simulation at this point and 
inspect values. If step 165 resolves to "YES," step 170 stops the simulation that may be 
currently underway and inspects various values to check for correctness of the circuit design. At 
5 step 175, the algorithm returns to the point at which it branched, which is at step 115. Here, the 
user can continue to simulate and stop/inspect values for the remainder of the test/debug session 
or proceed forward to the in-circuit emulation step. 

Similarly, if step 105 resolves to "NO," the algorithm will proceed to the hardware 
acceleration decision step 120. At step 120, the user decides whether to accelerate the test/debug 
yiO process by accelerating the simulation through the hardware portion of the modeled circuit 
S design. If the decision at step 120 resolves to "YES," then hardware model acceleration occurs 
£ at step 125. During the system compilation process, the SEmulation system mapped some 
i2 portions into a hardware model. Here, when hardware acceleration is desired, the system moves 

register and combinational components into the hardware model and moves the input and 
13 15 evaluation values to the hardware model. Thus, during hardware acceleration, the evaluation 
u occurs in the hardware model for a long time period at the accelerated speed. The kernel writes 
i::! test-bench output to the hardware model, updates the software clock, then reads the hardware 
H'' model output values cycle-by-cycle. If desired by the user, values from the entire software model 
of the user's circuit design, which is the entire circuit design, can be made available by outputting 
20 register values and combinational components by regenerating combinational components with the 
register values. Because of the need for software intervention to regenerate these combinational 
components, outputs of values for the entire software model are not provided at every cycle; 
rather, values are provided to the user only if the user wants such values. This specification will 
discuss the combinational component regeneration process later. 
25 Again, the user can stop the hardware acceleration mode at any time as indicated by step 

115. If the user wants to stop, the algorithm proceeds to steps 115 and 160 to branch to the 
stop/value inspect routine. Here, as in step 115, the user can stop the hardware accelerated 
simulation process at any time and inspect values resulting from the snnulation process, or the 
user can continue with the hardware-accelerated simulation process. The stop/value inspect 
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routine branches to steps 160, 165, 170, and 175, which were discussed above in the context of 
stopping the simulation. Returning to the main routine after step 125, the user can decide to 
continue with the hardware-accelerated simulation or perform pure simulation instead at step 135. 
If the user wants to simulate further, the algorithm proceeds to step 105. If not, the algorithm 
5 proceeds to the post-simulation analysis at step 140. 

At step 140, the SEmulation system provides a number of post-simulation analysis 
features. The system logs all uiputs to the hardware model. For hardware model outputs, the 
system logs all values of hardware register components at a user-defined logging frequency (e.g., 
1/10,000 record/cycle). The logging frequency determines how often the output values are 
^"io recorded. For a logging frequency of 1/10,000 record/cycle, output values are recorded once 
tfl every 10,000 cycles. The higher the logging frequency, the more information is recorded for 

later post-simulation analysis. Because the selected logging frequency has a causal relationship to 
i2 the SEmulation speed, the user selects the logging frequency with care. A higher logging 

frequency will decrease the SEmulation speed because the system must spend time and resources 
0 15 to record the output data by performing I/O operations to memory before further simulation can 
ii be performed. 

With respect to the post-simulation analysis, the user selects a particular point at which 
^ simulation is desired. The user can then perform analysis after SEmulation by running the 
software simulation with input logs to the hardware model to compute the value changes and 
20 internal states of all hardware components. Note that the hardware accelerator is used to simulate 
the data from the selected logging point to analyze simulation results. This post-simulation 
analysis method can link to any simulation waveform viewer for post-simulation analysis. More 
detailed discussion will follow. 

At step 145, the user can opt to emulate the simulated circuit design within its target 
25 system envhonment. If step 145 resolves to "NO," the algorithm ends and the SEmulation 
process ends at step 155. If emulation with the target system is desired, the algorithm proceeds 
to step 150. This step involves activating the emulation interface board, plugguig the cable and 
chip pin adapter to the target system, and running the target system to obtain the system I/O from 
the target system. The system I/O from the target system includes signals between the target 
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system and the emulation of the circuit design. The emulated circuit design receives input signals 
from the target system, processes these, sends them to the SEmulation system for further 
processing, and outputs the processed signals to the target system. Conversely, the emulated 
circuit design sends output signals to the target system, which processes these, and possibly 
5 outputs the processed signals back to the emulated circuit design. In this way, the performance of 
the circuit design can be evaluated in its natural target system environment. After the emulation 
with the target system, the user has results that validate the circuit design or reveal non-functional 
aspects. At this point, the user can simulate/emulate again as indicated at step 135, stop 
altogether to modify the circuit design, or proceed to integrated circuit fabrication based on the 
^;10 validated circuit design. 

iJ HI. SIMULATION/HARDWARE ACCELERATION MODES 

H : A high level diagram of the software compilation and hardware configuration during 

W compile time and run time in accordance with one embodiment of the present invention is shown 
0 15 in FIG. 3. FIG. 3 shows two sets of information: one set of information distinguishes the 
u operations performed during compile time and simulation/emulation run time; and the other set of 
15' information shows the partitioning between software models and hardware models. At the outset, 
1^^ the SEmulation system in accordance with one embodiment of the present invention needs the 
user circuit design as input data 200. The user circuit design is in some form of HDL file (e.g. , 
20 Verilog, VHDL). The SEmulation system parses the HDL file so that behavior level code, 
register transfer level code, and gate level code can be reduced to a form usable by the 
SEmulation system. The system generates a source design database for front end processing step 
205. The processed HDL file is now usable by the SEmulation system. The parsing process 
converts ASCII data to an internal binary data structure and is known to those ordinarily skilled 
25 in the art. Please refer to ALFRED V. AHO, RAVI SETHI, AND JEFFREY D. ULLMAN, 
COMPILERS: PRINCIPLES, TECHNIQUES, AND TOOLS (1988), which is incorporated by 
reference herein. 

Compile time is represented by processes 225 and run time is represented by 
processes/elements 230. During compilation time as indicated by process 225, the SEmulation 
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system compiles the processed HDL file by performing component type analysis. The component 
type analysis classifies HDL components into combinational components, register components, 
clock components, memory components, and test-bench components. Essentially, the system 
partitions the user circuit design into control and evaluation components. 
5 The SEmulation compiler 210 essentially maps the control components of the simulation 

into software and the evaluation components into software and hardware. The compiler 210 
generates a software model for all HDL components. The software model is cast in code 215, 
Additionally, the SEmulation compiler 210 uses the component type information of the HDL file, 
selects or generates hardware logic blocks/elements from a library or module generator, and 
yiO generates a hardware model for certain HDL components. The end result is a so-called 
yy "bitstream" configuration file 220. 

In preparation for run-time, the software model in code form is stored in main memory 
, J where the application program associated with the SEmulation program in accordance with one 
ill embodiment of the present invention is stored. This code is processed in the general purpose 
qT5 processor or workstation 240. Substantially concurrently, the configuration file 220 for the 
2 hardware model is used to map the user circuit design into the reconfigurable hardware boards 
250. Here, those portions of the circuit design that have been modeled in hardware are mapped 
and partitioned into the FPGA chips in the reconfigurable hardware boards 250. 

As explained above, user test-bench stimulus and test vector data as well as other test- 
20 bench resources 235 are applied to the general purpose processor or workstation 240 for 
simulation purposes. Furthermore, the user can perform emulation of the circuit design via 
software control. The reconfigurable hardware boards 250 contain the user's emulated circuit 
design. This SEmulation system has the ability to let the user selectively switch between software 
simulation and hardware emulation, as well as stop either the simulation or emulation process at 
25 any time, cycle-by-cycle, to inspect values from every component in the model, whether register 
or combinational. Thus, the SEmulation system passes data between the test-bench 235 and the 
processor/workstation 240 for simulation and the test-bench 235 and the reconfigurable hardware 
boards 250 via data bus 245 and processor/workstation 240 for emulation. If a user target system 
260 is involved, emulation data can pass between the reconfigurable hardware boards 250 and the 
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target system 260 via the emulation interface 255 and data bus 245. The kernel is found in the 
software simulation model in the memory of the processor/workstation 240 so data necessarily 
pass between the processor/workstation 240 and the reconfigurable hardware boards 250 via data 
bus 245. 

5 FIG. 4 shows a flow chart of the compilation process in accordance with one embodiment 

of the present invention. The compilation process is represented as processes 205 and 210 in 
FIG. 3. The compilation process in FIG. 4 starts at step 300. Step 301 processes the front end 
information. Here, gate level HDL code is generated. The user has converted the initial circuit 
design into HDL form by directly handwrituig the code or using some form of schematic or 
10 synthesis tool to generate the gate level HDL representations of the code. The SEmulation 
S system parses the HDL file (in ASCII format) into a binary format so that behavior level code, 
.p register transfer level (RTL) code, and gate level code can be reduced to an internal data structure 
I r form usable by the SEmulation system. The system generates a source design database 
^" ^ containing the parsed HDL code. 

C3 15 Step 302 performs component type analysis by classifying HDL components into 

I. f% 

combinational components, register components, clock components, memory components, and 
'X test-bench components as shown in component type resource 303. The SEmulation system 
H generates hardware models for register and combinational components, with some exceptions as 
discussed below. Test-bench and memory components are mapped in software. Some clock 
20 components (e.g., derived clocks) are modeled in hardware and others reside in the 
software/hardware boundary (e.g., software clocks). 

Combinational components are stateless logic components whose output values are a 
fimction of current input values and do not depend on the history of input values. Examples of 
combinational components include primitive gates (e.g., AND, OR, XOR, NOT), selector, 
25 adder, multiplier, shifter, and bus drivers. 

Register components are simple storage components. The state transition of a register is 
controlled by a clock signal. One form of register is edge-triggered which may change states 
when an edge is detected. Another form of register is a latch, which is level triggered. 
Examples include flip-flops (D-type, JK-type) and level-sensitive latches. 
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Clock components are components that deliver periodic signals to logic devices to control 
their behavior. Typically, clock signals control the update of registers. Primary clocks are 
generated from self-timed test-bench processes. For example, a typical test-bench process for 
clock generation in Verilog is as follows: 
5 always begin 

Clock = 0; 
#5; 

Clock = 1; 
#5; 

fjo end; 

j£ According to this code, the clock signal is initially at logic "0." After 5 time units, the clock 

M signal changes to logic " L " After 5 time units, the clock signal reverts back to logic "0. " 

|,^" Usually, the primary clock signals are generated in software and only a few (i.e., 1-10) primary 

clocks are found in a typical user circuit design. Derived or gated clocks are generated from a 
0 15 network of combinational logic and registers that are in turn driven by the primary clocks. Many 

(i.e., 1,000 or more) derived clocks are found in a typical user circuit design. 
% Memory components are block storage components with address and control lines to 

access individual data in specific memory locations. Examples include ROM, asynchronous 

RAM, and synchronous RAM. 
20 Test-bench components are software processes used to control and monitor the simulation 

processes. Accordingly, these components are not part of the hardware circuit design under test. 

Test-bench components control the simulation by generatmg clock signals, initializing simulation 

data, and reading simulation test vector patterns from disk/memory. Test-bench components also 

monitor the simulation by checking for changes in value, performing value change dump, 
25 checking asserted constraints on signal value relations, writing output test vectors to 

disk/memory, and interfacing with various waveform viewers and debuggers. 

The SEmulation system performs component type analysis as follows. The system 

examines the binary source design database. Based on the source design database, the system can 

characterize or classify the elements as one of the above component types. Continuous 
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assignment statements are classified as combinational components. Gate primitives are either 

combinational type or latch form of register type by language definition. Initialization code are 

treated as test-benches of initialization type. 

An always process that drives nets without using the nets is a test-bench of driver type. 
5 An always process that reads nets without driving the nets is a test-bench of monitor type. An 

always process with delay controls or multiple event controls are test-benches of general type. 
An always process with a single event control and driving a single net can be one of the 

following: (1) If the event control is edge-triggered event, then the process is an edge-triggered 

type register component. (2) If a net driven in a process is not defined in all possible execution 
SlO paths, then the net is a latch type of register. (3) If a net driven in a process is defined m all 

possible execution paths, then the net is a combinational component. 
45 An always process with a single event control but driving multiple nets can be 

1=^ decomposed into several processes driving each net separately to derive their respective 
r component types separately. The decomposed processes can then be used to determine 
't!! 1 5 component type . 

H Step 304 generates a software model for all HDL components, regardless of component 

ry 

P type. With the appropriate user interface, the user is capable of simulating the entire circuit 
^ design using the complete software model. Test-bench processes are used to drive the stimulus 
input, test vector patterns, control the overall simulation, and monitor the sunulation process. 
20 Step 305 performs clock analysis. The clock analysis includes two general steps: (1) 

clock extraction and sequential mapping, and (2) clock network analysis. The clock extraction 
and sequential mapping step includes mappmg the user's register components into the SEmulation 
system's hardware register model and then extracting clock signals out of the system's hardware 
register components. The clock network analysis step includes determining primary clocks and 
25 derived clocks based on the extracted clock signals, and separating the gated clock network and 
gated data network. A more detailed description will be provided with respect to FIG. 16. 

Step 306 performs residence selection. The system, in conjunction with the user, selects 
the components for hardware models; that is, of the universe of possible hardware components 
that can be implemented m the hardware model of the user's circuit design, some hardware 

32 

SV/2255S3.01 
16503.302504 




components will not be modeled in hardware for a variety of reasons. These reasons include 
component types, hardware resource constraints (i.e,, floating point operations and large multiply 
operations stay in software), simulation and communication overhead (i.e., small bridge logic 
between test-bench processes stay in software, and signals that are monitored by test-bench 
5 processes stay in software), and user preferences. For a variety of reasons including performance 
and simulation monitoring, the user can force certain components that would otherwise be 
modeled in hardware to stay in software. 

Step 307 maps the selected hardware models into a reconfigurable hardware emulation 
board. In particular, step 307 maps takes the netlist and maps the circuit design into specific 
y 10 FPGA chips. This step involves grouping or clustering logic elements together. The system then 
ffl assigns each group to a unique FPGA chip or several groups to a single FPGA chip. The system 
^£ may also split groups to assign them to different FPGA chips. In general, the system assigns 
,f groups to FPGA chips. More detailed discussion will be provided below with respect to FIG. 6. 
Vl The system places the hardware model components into a mesh of FPGA chips to minimize 
p 15 inter-chip communication overhead. In one embodiment, the array comprises a 4x4 array of 
2 FPGAs, a PCI interface unit, and a software clock control unit. The array of FPGAs implements a 
^ portion of the user's hardware circuit design, as determined above in steps 302-306 of this software 
U compilation process. The PCI interface unit allows the reconfigurable hardware emulation model 
to communicate with the workstation via the PCI bus. The software clock avoids race conditions 
20 for the various clock signals to the array of FPGAs. Furthermore, step 307 routes the FPGA chips 
according to the communication schedule among the hardware models. 

Step 308 inserts the control circuits. These control circuits include the I/O address 
pointers and data bus logic for communicating with the DMA engine to the simulator (discussed 
below with respect to FIGS. 11, 12, and 14), and the evaluation control logic to control hardware 
25 state transitions and wire multiplexing (discussed below with respect to FIGS. 19 and 20). As 
known to those ordinarily skilled in the art, a direct memory access (DMA) unit provides an 
additional data channel between peripherals and main memory in which the peripherals can 
directly access (i.e., read, write) the main memory without the intervention of the CPU. The 
address pointer in each FPGA chip allows data to move between the software model and the 

33 

SV/225583.01 
16503.302504 




hardware model in light of the bus size limitations. The evaluation control logic is essentially a 
finite state machine that ensures that the clock enable inputs to registers to be asserted before the 
clock and data inputs enter these registers. 

Step 309 generates the configuration files for mapping the hardware model to FPGA 
5 chips. In essence, step 309 assigns circuit design components to specific cells or gate level 
components in each chip. Whereas step 307 determines the mapping of hardware model groups 
to specific FPGA chips, step 309 takes this mapping result and generates a configuration file for 
each FPGA chip. 

Step 310 generates the software kernel code. The kernel is a sequence of software code 
^"ilO that controls the overall SEmulation system. The kernel cannot be generated until this point 
ffl because portions of the code require updating and evahiating hardware components. Only after 
45 step 309 has the appropriate mapping to hardware models and FPGA chips occurred. More 
I J detailed discussion will be provided below with respect to FIG. 5. The compilation ends at step 
311. 

Q 15 As mentioned above with respect to FIG, 4, the software kernel code is generated in step 

il 310 after the software and hardware models have been determined. The kernel is a piece of 
iif software in the SEmulation system that controls the operation of the overall system. The kernel 
H controls the execution of the software simulation as well as the hardware emulation. Because the 

kernel also resides in the center of the hardware model, the simulator is integrated with the 
20 emulator. In contrast to other known co-simulation systems, the SEmulation system in 

accordance with one embodiment of the present invention does not require the simulator to 

interact with the emulator from the outside. One embodiment of the kernel is a control loop 

shown in FIG. 5. 

Referring to FIG. 5, the kernel begins at step 330. Step 331 evaluates the initialization 
25 code. Beginning at step 332 and bounded by the decision step 339, the control loop begins and 
cycles repeatedly until the system observes no active test-bench processes, in which case the 
simulation or emulation session has completed. Step 332 evaluates the active test-bench 
components for the simulation or emulation. 

Step 333 evaluates clock components. These clock components are from the test-bench 
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process. Usually, the user dictates what type of clock signal will be generated to the simulation 
system. In one example (discussed above with respect to component type analysis and 
reproduced here), a clock component as designed by a user in the test-bench process is as 
follows: 

5 always begin 

Clock = 0; 
#5; 

Clock = 1; 
#5; 

OlO end; 

lO The user has decided, in this clock component example, that a logic "0" signal will be 

generated first, and then after 5 simulation times later, a logic " 1" signal will be generated. This 
. 7 clock generation process will cycle continuously until stopped by the user. These simulation 
in times are advanced by the kernel 

Si 

13 15 Decision step 334 inquires whether any active clock edge is detected, which would result 

2 in some kind of logic evaluation in the software and possible hardware model (if emulation is 
2 running). The clock signal, which the kernel uses to detect an active clock edge, is the clock 
U signal from the test-bench process. If the decision step 334 evaluates to "NO," then the kernel 
proceeds to step 337. If the decision step 334 evaluates to "YES," resulting in step 335 updating 
20 registers and memories, and step 336 propagating combinational components. Step 336 

essentially takes care of combinational logic which needs some time to propagate values through 
the combinational logic network after a clock signal has been asserted. Once the values have 
propagated through the combinational components and stabilized, the kernel proceeds to step 337. 
Note that registers and combinational components are also modeled in hardware and thus, 
25 the kernel controls the emulator portion of the SEmulation system. Indeed, the kernel can 

accelerate the evaluation of the hardware model in steps 334 and 335 whenever any active clock 
edge is detected. Hence, unlike the prior art, the SEmulation system in accordance with one 
embodiment of the present invention can accelerate the hardware emulator through the software 
kernel and based on component type (e.g., register, combinational). Furthermore, the kernel 
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controls the execution of the software and hardware model cycle by cycle. In essence, the 
emulator hardware model can be characterized as a simulation coprocessor to the general-purpose 
processor running the simulation kerneL The coprocessor speeds up the simulation task. 

Step 337 evaluates active test-bench components. Step 338 advances the sunulation time, 
5 Step 339 provides the boundary for the control loop that begins at step 332. Step 339 determines 
whether any test-bench processes are active. If so, the simulation and/or emulation is still 
running and more data should be evaluated. Thus, the kernel loops to step 332 to evaluate any 
active test-bench components. If no test-bench processes are active, then the simulation and 
emulation processes have completed. Step 340 ends the simulation/emulation process. In sum, 

u 10 the kernel is the main control loop that controls the operation of the overall SEmulation system. 

£1 So long as any test-bench processes are active, the kernel evaluates active test-bench components, 
evaluates clocks components, detects clock edges to update registers and memories as well as 

y propagate combinational logic data, and advances the simulation time. 

m FIG, 6 shows one embodiment of a method of automatically mapping hardware models to 

g 15 reconfigurable boards. A netlist file provides the input to the hardware implementation process. 
[1 The netlist describes logic functions and their interconnections. The hardware model-to-FPGA 
^ implementation process includes three independent tasks: mapping, placement, and routing. The 
H tools are generally referred to as "place-and-route" tools. The design tool used may be Viewlogic 
Viewdraw, a schematic capture system, and Xilinx Xact place and route software, or Altera's 
20 MAX+PLUS II system. 

The mapping task partitions the circuit design into the logic blocks, I/O blocks, and other 
FPGA resources. Although some logic fimctions such as flip-flops and buffers may map directly 
into the corresponding FPGA resource, other logic fimctions such as combinational logic must be 
implemented in logic blocks using mapping algorithms. The user can usually select mappmg for 
25 optimal density or optimal performance. 

The placement task involves taking the logic and I/O blocks from the mapping task and 
assigning them to physical locations within the FPGA array. Current FPGA tools generally use 
some combination of three techniques: mincut, simulating annealing, and general force-dhected 
relaxation (GFDR). These techniques essentially determine optimal placement based on various 
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cost functions which depend on total net length of interconnections or the delay along a set of 
critical signal paths, among other variables. The Xilinx XC4000 series FPGA tools use a 
variation of the mincut technique for initial placement followed by a GFDR technique for fine 
improvement in the placement. 
5 The routing task involves determining the routing paths used to interconnect the various 

mapped and placed blocks. One such router, called a maze router, seeks the shortest path 
between two points. Since the routing task provides for direct interconnection among the chips, 
the placement of the circuits with respect to the chips is critical. 

At the outset, the hardware model can be described in either gate netlist 350 or RTL 357. 
0[Q The RTL level code can be further synthesized to gate level netlist. During the mapping 
ffl process, a synthesizer server 360, such as the Altera MAX+PLUS II programmable logic 
^ development tool system and software, can be used to produce output files for mapping purposes. 
, The synthesizer server 360 has the ability to match the user's circuit design components to any 
In standard existing logic elements found in a library 361 (e.g., standard adders or standard 
0 15 multipliers), generate any parameterized and frequently used logic module 362 (e.g. , non- 

standard multiplexers or non-standard adders), and synthesize random logic elements 363 (e.g., 
[Jf look-up table-based logic that implements a customized logic function). The synthesizer server 
also removes redundant logic and unused logic. The output files essentially synthesize or 
optimize the logic required by the user's circuit design. 
20 When some or all of the HDL is at the RTL level, the circuit design components are at a 

high enough level such that the SEmulation system can easily model these components using 
SEmulation registers or components. When some or all of the HDL is at the gate netlist level, 
the circuit design components may be more circuit design-specific, making the mapping of user 
circuit design components to SEmulation components more difficult. Accordingly, the 
25 synthesizer server is capable of generating any logic element based on variations of standard logic 
elements or random logic elements that may not have any parallels in these variations or library 
standard logic elements. 

If the chcuit design is m gate netlist form, the SEmulation system will initially perform 
the grouping or clustering operation 351. The hardware model construction is based on the 
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clustering process because the combinational logic and registers are separated from the clock. 
Thus, logic elements that share a common primary clock or gated clock signal may be better 
served by grouping them together and placed on a chip together. The clustering algorithm is 
based on connectivity driven, hierarchical extraction, and regular structure extraction. If the 
5 description is in structured RTL 358, the SEmulation system can decompose the function into 
smaller units as represented by the logic function decomposition operation 359. At any stage, if 
logic synthesis or logic optimization is required, a synthesizer server 360 is available to transform 
the circuit design to a more efficient representation based on user directives. For the clustering 
operation 351, the link to the synthesizer server is represented by dotted arrow 364. For the 
TfjlO structured RTL 358, the link to the synthesizer server 360 is represented by arrow 365. For the 
fi logic function decomposition operation 359, the link to the synthesizer server 360 is represented 
by arrow 366. 

The clustering operation 351 groups the logic components together m a selective manner 
based on function and size. The clustermg may involve only one cluster for a small circuit design 
^15 or several clusters for a large circuit design. Regardless, these clusters of logic elements will be 
U used in later steps to map them into the designated FPGA chips; that is, one cluster will be 
f2 targeted for a particular chip and another cluster will be targeted for a different chip or possibly 
the same chip as the first cluster. Usually, the logic elements in a cluster will stay together with 
the cluster in a chip, but for optimization purposes, a cluster may have to be split up into more 
20 than one chip. 

After the clusters are formed in the clustering operation 351, the system performs a place- 
and-route operation. Initially, a coarse-grain placement operation 352 of the clusters into the 
FPGA chips is performed. The coarse-grain placement operation 352 initially places clusters of 
logic elements to selected FPGA chips. If necessary, the system makes the synthesizer server 
25 360 available to the coarse-grain placement operation 352 as represented by arrow 367. A fine- 
grain placement operation is performed after the coarse-grain placement operation to fine-tune the 
initial placement. The SEmulation system uses a cost function based on pin usage requirements, 
gate usage requirements, and gate-to-gate hops to determine the optimal placement for both the 
coarse-grain and fine-grain placement operations. 
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The determination of how clusters are placed in certain chips is based on placement cost, 
which is calculated through a cost function f(P, G, D) for two or more circuits (i.e., CKTQ = 
CKTl, CKT2, . . . , CKTN) and their respective locations in the array of FPGA chips, where P 
is generally the pin usage/availability, G is generally the gate usage/availability, and D is the 
5 distance or number of gate-to-gate "hops" as defined by a connectivity matrix M (shown in FIG. 
7 in conjunction with FIG. 8). The user's circuit design that is modeled in the hardware model 
comprises the total combination of circuits CKTQ. Each cost function is defined such that the 
computed values of the calculated placement cost tend to generally promote: (1) a minunum 
number of "hops" between any two circuits CKTN-1 and CKTN m the FPGA array, and (2) 

^ ^10 placement of circuits CKTN-1 and CKTN in the FPGA array such that pin usage is minimized. 

v3 In one embodiment, the cost function F(P, G, D) is defined as: 

ft f{P, G, Z)) = [CO * M4Z^,,_^^_,,^ (-^^)] + [CI * MAX,^,_,,aA_oMp i-^^)] + 

1 available ^available 

n [C2* Y.'DIST(FPGA.,FPGAj)] 

^ (U)eCKT 

fU 

13 15 This equation can be simplified to the form: 

f(P,G,D) = CO^P + C1*G + C2*D 

The first term (i.e., CO*P) generates a first placement cost value based on the number of 
20 pins used and the number pins available. The second term (i.e., C1*G) generates a second 

placement cost value based on the number of gates used and the number of gates available. The 
third term (i.e., C2*D) generates a placement cost value based on the number of hops present 
between various interconnecting gates in the circuits CKTQ (i.e., CKTl, CKT2, . . . , CKTN). 
The overall placement cost value is generated by iteratively summing these three placement cost 
25 values. Constants CO, CI, and C2 represent weighting constants that selectively skew the overall 
placement cost value generated from this cost function toward the factor or factors (i.e., pin 
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usage, gate usage, or gate-to-gate hops) that is/are most important during any iterative placement 
cost calculation. 

The placement cost is calculated repeatedly as the system selects different relative values 
for the weighting constants CO, CI, and C2. Thus, in one embodiment, during the coarse-grain 
5 placement operation, the system selects large values for CO and CI relative to C2. In this 

iteration, the system determines that optimizing pin usage/availability and gate usage/availability 
are more important than optimizing gate-to-gate hops in the initial placement of the circuits 
CKTQ in the array of FPGA chips. In a subsequent iteration, the system selects small values for 
CO and CI relative to C2. In this iteration, the system determines that optimizing gate-to-gate 
'TJO hops is more important than optimizing pin usage/availability and gate usage/availability. 
^0 During the fine-grain placement operation, the system uses the same cost function. In one 

^ embodiment, the iterative steps with respect to the selection of CO, CI, and C2 are the same as 
i2 for the coarse-grain operation. In another embodiment, the fine-grain placement operation 

involves having the system select small values for CO and CI relative to C2. 
0 15 An explanation of these variables and equations will now be discussed. In determining 

whether to place certain circuits CKTQ in FPGA chip x or FPGA chip y (among other FPGA 
H chips), the cost function examines pin usage/availability (P), gate usage/availability (G), and gate- 
H to-gate hops (D). Based on the cost function variables, P, G, and D, the cost function f(P, G, D) 
generates a placement cost value for placing circuits CKTQ in particular locations in the FPGA 



Pin usage/availability P also represents the I/O capacity. P^s^d is the number of used pins 
by the circuits CKTQ for each FPGA chip. Pavauabie the number of available pins in the FPGA 
chip. In one embodiment, is Pavmiawe is 264 (44 pins x 6 interconnections/chip), while in another 
embodiment, Pavaiiabie is 265 (44 pins x 6 interconnections/chip -h 1 extra pin). However, the 
25 specific number of available pins depends on the type of FPGA chip used, the total number of 
interconnections used per chip, and the number of pins used for each interconnection. Thus, 
Pavaiiabie ^au Vary considerably. So, to evaluate the first term of the cost function F(P, G, D) 
equation (i.e., CO*P), the ratio Pused^Pavauabie is calculated for each FPGA chip. Thus, for a 4x4 
array of FPGA chips, sixteen ratios Pused^Pavaiiabie calculated. The more pins are used for a 
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given number of available pins, the higher the ratio. Of the sixteen calculated ratios, the ratio 
yielding the highest number is selected. The first placement cost value is calculated from the first 
term CO*P by multiplying the selected maximum ratio Pused^Pavaiiabie with the weighting constant 
CO. Because this first term depends on the calculated ratio Pused/Pavaiiabie the particular 
5 maximum ratio among the ratios calculated for each FPGA chip, the placement cost value will be 
higher for higher pin usage, all other factors being equal. The system selects the placement 
yielding the lowest placement cost. The particular placement yielding a maximum ratio 
Pused^Pavaiiabie that is thc lowcst among all the maximums calculated for various placements is 
generally considered as the optimum placement in the FPGA array, all other factors being equal. 
J'fiO The gate usage/availability G is based on the number of gates allowable by each FPGA 

fi chip. In one embodiment, based on the location of the circuits CKTQ in the array, if the number 
^ of gates used G^^^^ in each chip is above a certain threshold, then this second placement cost 
il (C1*G) will be assigned a value indicating that the placement is not feasible. Analogously, if the 

number of gates used in each chip containing circuits CKTQ is at or below a certain threshold, 
Ql5 then this second term (C1*G) will be assigned a value indicating that the placement is feasible, 
M Thus, if the system mitially wants to place ckcuit CKTl in a particular chip and that chip does 
not have enough gates to accommodate the circuit CKTl, then the system may conclude through 
the cost function that this particular placement is infeasible. Generally, the high number (e.g., 
infinity) for G ensures that the cost function will generate a high placement cost value indicating 
20 that the desired placement of the circuits CKTQ is not feasible and that an alternative placement 
should be determined. 

In another embodiment, based on the location of the circuits CKTQ in the array, the ratio 
Gused^Gavaiiabie calculatcd for cach chip, where G^s^^ is the number of gates used by the circuits 
CKTQ in each FPGA chip, and G^^^^^^^^ is the number of gates available in each chip. In one 
25 embodiment, the system uses the FLEX lOKlOO chip for the FPGA array. The FLEX lOKlOO 
chip contains approximately 100,000 gates. Thus, in this embodiment, Ga^aiiabie is equal to 100,000 
gates. Thus, for a 4x4 array of FPGA chips, sixteen ratios Gused/Ggvaiiabie ^^e calculated. The more 
gates are used for a given number of available gates, the higher the ratio. Of the sixteen 
calculated ratios, the ratio yielding the highest number is selected. The second placement cost 
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value is calculated from the second term C1*G by multiplying the selected maximum ratio 
Gused^G^^aiiabie with thc Weighting constant CI. Because this second term depends on the calculated 
ratio G^^^/Gavaiiabie ^iid the particular maximum ratio among the ratios calculated for each FPGA 
chip, the placement cost value will be higher for higher gate usage, all other factors being equal. 
5 The system selects the circuit placement yielding the lowest placement cost. The particular 
placement yielding a maximum ratio G^^JG^^^^y^^^^ that is the lowest among all the maximums 
calculated for various placements is generally considered as the optimum placement in the FPGA 
array, all other factors being equal. 

In another embodiment, the system selects some value for CI initially. If the ratio 
;:^10 Gused/G^vaiiabie IS greater than 1 ," then this particular placement is infeasible (i.e. , at least one chip 
^5 does not have enough gates for this particular placement of circuits). As a result, the system 

iff 

*g modifies CI with a very high number (e.g., infinity) and accorduigly, the second term C1*G will 
I r also be a very high number and the overall placement cost value f(P, G, D) will also be very 

high. If, on the other hand, the ratio G^JG^^^^^^^^ is less than or equal to "1," then this particular 
0 15 placement is feasible (i.e., each chip has enough gates to support the circuit implementation). As 
i2 a result, the system does not modify CI and accordingly, the second term C1*G will resolve to a 
{.^ particular number. 

H The third term C2*D represents the number of hops between all gates that require 

interconnection. The number of hops also depends on the interconnection matrix. The 

20 connectivity matrix provides the foundation for determining circuit paths between any two gates 
that need chip-to-chip interconnection. Not every gate needs the gate-to-gate uiterconnection. 
Based on the user's original circuit design and the partitioning of clusters to certain chips, some 
gates will not need any interconnection whatsoever because the logic element(s) connected to their 
respective input(s) and output(s) is/are located in the same chip. Other gates, however, need the 

25 interconnections because the logic element(s) connected to their respective input(s) and output(s) 
is/are located in different chips. 

To understand "hops," refer to the connectivity matrix shown in tabular form in FIG. 7 
and in pictorial form in FIG. 8. In FIG. 8, each interconnection between chips, such as 
interconnection 602 between chip Fll and chip F14, represents 44 pins or 44 wire lines. In other 
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embodiments, each interconnection represents more than 44 pins. In still other embodiments, 
each interconnection represents less than 44 pins. 

Using this interconnection scheme, data can pass from one chip to another chip within two 
"hops" or "jumps." Thus, data can pass from chip Fll to chip F12 in one hop via 
5 interconnection 601, and data can pass from chip Fll to chip F33 in two hops via either 

interconnections 600 and 606, or interconnections 603 and 610. These exemplary hops are the 
shortest path hops between these sets of chips. In some instances, signals may be routed through 
various chips such that the number of hops between a gate in one chip and a gate in another chip 
exceeds the shortest path hop. The only circuit paths that must be examined in determining the 
':flO number of gate-to-gate hops are the ones that need the interconnections. 
^3 The connectivity is represented by the sum of all hops between the gates that need the 

^ inter-chip hater connections. The shortest path between any two chips can be represented by one 
,2 or two "hops" using the connectivity matrix of FIGS. 7 and 8. However, for certain hardware 
VI model implementations, I/O capacity may limit the number of direct shortest path connections 



ai5 between any two gates in the array and hence, these signals must be routed through longer paths 

lI (and therefore more than two hops) to reach their destinations. Accordingly, the number of hops 

J 5 may exceed two for some gate-to-gate connections. Generally, all things being equal, a smaller 

H number of hops results in a smaller placement cost. 



This third term is the product of a weightmg constant C2 and a summation component (S 
...). The summation component is essentially the sum of all hops between each gate i and gate j 
25 in the user's circuit design that require chip-to-chip interconnections. As discussed above, not all 
gates need inter-chip interconnections. For those gates i and gates j that need inter-chip 
interconnections, the number of hops is determined. For all gates i and gates j, the total number 
of hops is added together. 



The third term (i.e., C2*D) is reproduced in long form as follows: 



20 



/(P,G,i))- . . . [C2* Y.DISTiFPGA,,FPGAj)] 



{i,j)eCKT 
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The distance calculation can also be defined as: 



DIST {FPGAi,FPGAj) = miniMij = D 

{i,j)sCKT k 

Here, M is the connectivity matrix. One embodiment of the connectivity matrix is shown 
5 in FIG. 7. The distance is calculated for each gate-to-gate connection requiring an 

intercoimection. Thus, for each gate i and gate j comparison, the connectivity matrix M is 
examined. More specifically, 

10 A matrix is set up with all chips in the array such that each chip is identifiably numbered. 

These identifying numbers are set up at the top of the matrix as a column header. Similarly, 
these identifying numbers are set up along the side of the matrix as a row header. A particular 
I entry at the intersection of a row and column in this matrix provides the direct connectivity data 
; between the chip identified by the row and the chip identified by the column at which the 
15 mtersections occur. For any distance calculation between chip i and chip j, an entry in the matrix 
Mj j contains either a " 1" for a direct connection or "0" for no direct connection. The index k 
refers to the number of hops necessary to interconnect any gate in chip i to any gate in chip j 
requuring the interconnections. 

Initially, the connectivity matrix M^^ for k= 1 should be examined. If the entry is " 1," a 
20 direct connection exists for this gate in chip i to the selected gate in chip j. Thus, the index or 
hop k = 1 is designated as the result of j and this result is the distance between these two gates. 
At this point, another gate-to-gate connection can be examined. However, if the entry is "0," 
then no direct connection exists. 

If no direct connection exists, the next k should be examined. This new k (i.e. , k=2) can 
25 be computed by multiplying matrix j with itself; in other words, M^=M*M, where k=2. 
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This process of multiplying M to itself until the particular row and colunm entry for chip i 
and chip j continues until the calculated result is " 1" at which point the index k is selected as the 
number of hop. The operation includes ANDing matrices M together and then ORing the 
ANDed results. If the AND operation between matrix m^ ^and m^j results in a logic "1" value, 
5 then a connection exists between a selected gate in chip i and a selected gate in chip j through any 
chip 1 within hop k; if not, no connection exists within this particular hop k and further 
calculation is necessary. The matrices m^iand mijare the connectivity matrix M as defined for 
this hardware modeling. For any given gate i and gate j requiring the interconnections, the row 
containing the FPGA chip for gate i in matrix m^ ^ is logically ANDed to the column containing 
';;^10 the FPGA chip for gate j and m^ j. The individual ANDed components are ORed to determine if 
^ the resulting My value for index or hop k is a " 1 " or " 0. " If the result is a " 1 , " then a 
M connection exists and the index k is designated as the number of hops. If the resuh is "0," then 
no connection exists. 

The following example illustrates these principles. Refer to FIGS. 35(A) to 35(D). FIG. 
P 15 35(A) shows a user's circuit design represented as a cloud 1090. This circuit design 1090 may be 
l1 simple or complex. A portion of the circuit design 1090 includes an OR gate 1091 and two AND 

gates 1092 and 1093. The outputs of AND gates 1092 and 1093 are coupled to the inputs of OR 
1^ gate 1091. These gates 1091, 1092, and 1093 may also be coupled to other portions of the circuit 

design 1090. 

20 Referring to FIG. 35(B), the components of this circuit 1090, including the portion 

containing the three gates 1091, 1092, and 1093, may be configured and placed in FPGA chips 
1094, 1095, and 1096. This particular exemplary array of FPGA chips has the interconnection 
scheme as shown; that is, a set of interconnections 1097 couple chip 1094 to chip 1095, and 
another set of interconnections 1098 couple chip 1095 to chip 1096. No direct interconnections 

25 are provided between chip 1094 and chip 1096. When placing the components of this circuit 
design 1090 into chips, the system uses the pre-designed interconnection scheme to connect 
circuit paths across different chips. 

Referring to FIG. 35(C), one possible configuration and placement is OR gate 1091 placed 
in chip 1094, AND gate 1092 placed in chip 1095, and AND gate 1093 placed hi chip 1096. 
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Other portions of the circuit 1090 are not shown for pedagogic purposes. The connection 
between OR gate 1091 and AND gate 1092 requires an interconnection because they are located 
in different chips so the set of interconnections 1097 is used. The number of hops for this 
interconnection is " L" The connection between OR gate 1091 and AND gate 1093 also requires 
5 interconnections so sets of interconnections 1097 and 1098 are used. The number of hops is "2/' 
For this placement example, the total number of hops is ''3," discounting the contribution from 
other gates and their interconnections in the remainder of circuit 1090 that are not shown. 

FIG. 35(D) shows another placement example. Here, OR gate 1091 is placed in chip 
1094, and AND gates 1092 and 1093 are placed in chip 1095. Again, other portions of the 
':^0 circuit 1090 are not shown for pedagogic purposes. The connection between OR gate 1091 and 
S AND gate 1092 requires an interconnection because they are located in different chips so the set 
£ of interconnections 1097 is used. The number of hops for this interconnection is " 1 . " The 

connection between OR gate 1091 and AND gate 1093 also requires interconnections so the set of 
^ interconnections 1097 is used. The number of hops is also "1,'* For this placement example, the 
015 total number of hops is "2," discounting the contribution from other gates and their 
2 interconnections m the remainder of circuit 1090 that are not shown. So, on the basis of the 
J,^ distance D parameter only and assuming all other factors are equal, the cost function calculates a 
H lower cost function for the placement example of FIG. 35(D) than the placement example of FIG. 
35(C). However, all other factors are not equal. More than likely, the cost function for FIG. 
20 35(D) is also based on the gate usage/availability G. In FIG. 35(D), one more gate is used in 
chip 1095 than that used in the same chip in FIG. 35(C). Furthermore, the pin usage/availability 
P for chip 1095 m the placement example illustrated ui FIG. 35(C) is greater than the pin 
usage/availability for the same chip in the other placement example illustrated in FIG. 35(D). 
After the coarse-gram placement, a fine tuning of the placement of the flattened clusters 
25 will further optimize the placement result. This fine-grain placement operation 353 refines the 
placement initially selected by the coarse-grain placement operation 352. Here, initial clusters 
may be split up if such an arrangement will increase the optimization. For example, assume logic 
elements X and Y are originally part of cluster A and designated for FPGA chip 1. Due to the 
fine-grain placement operation 353, logic elements X and Y may now be designated as a separate 
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cluster B or made part of another cluster C and designated for placement in FPGA chip 2. An 
FPGA netlist 354, which ties the user's circuit design to specific FPGAs, is then generated. 

The determination of how clusters are split up and placed in certain chips is also based on 
placement cost, which is calculated through a cost function f(P, G, D) for circuits CKTQ. In one 
5 embodiment, the cost function used for the fine-grain placement process is the same as the cost 
function used for the coarse-grain placement process. The only difference between the two 
placement processes is the size of the clusters placed, not in the processes themselves. The 
coarse-grain placement process uses larger clusters than the fine-grain placement process. In 
other embodiments, the cost functions for the coarse-grain and fine-grain placement processes are 

';^0 different firom each other, as described above with respect to selecting weighting constants CO, 

jfl Cl,andC2. 

£ Once the placement is complete, a routing task 355 among the chips is performed. If the 

i2 number of routing wires to connect circuits located in different chips exceeds the available pins in 

these FPGA chips allocated for the circuit-to-circuit routing, time division muhiplex (TDM) 
015 circuits can be used. For example, if each FPGA chip allows only 44 pins for connecting circuits 
l2 located m two different FPGA chips, and a particular model implementation requires 45 wires 
IS between chips, a special time division multiplex circuit will also be unplemented in each chip. 
N This special TDM circuit couples at least two of the wires together. One embodiment of the 
TDM circuit is shown in FIGS. 9(A), 9(B), and 9(C), which will be discussed later. Thus, the 
20 routing task can always be completed because the pins can be arranged into time division 
multiplex form among the chips. 

Once the placement and routing of each FPGA is determined, each FPGA can be 
configured into optimized and working circuits and accordingly, the system generates a 
"bitstream" configuration file 356. In Altera terminology, the system generates one or more 
25 Programmer Object Files (.pof). Other generated files include SRAM Object Files (.sof), 

JEDEC Files (.jed). Hexadecimal (Intel-format) Files (.hex), and Tabular Text Files (.ttf). The 
Altera MAX+PLUS II Programmer uses POFs, SOFs, and JEDEC Files along with Altera 
hardware programmable devices to program the FPGA array. Alternatively, the system generates 
one or more raw binary files (.rbf). The CPU revises .rbf files and programs the FPGA 
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array through the PCI bus. 

At this point, the configured hardware is ready for hardware start-up 370. This completes 
the automatic construction of hardware models on the reconfigurable boards. 

Returning to the TDM circuit that allows groups of pin outputs to be time-multiplexed 
5 together so that only one pin output is actually used, the TDM circuit is essentially a multiplexer 
with at least two inputs (for the two wires), one output, and a couple of registers configured in a 
loop as the selector signal. If the SEmulation system requires more whes to be grouped together, 
than more inputs and loop registers can be provided. As the selector signal to this TDM circuit, 
several registers configured in a loop provide the appropriate signals to the multiplexer so that at 
^:10 one time period, one of the inputs is selected as the output, and at another time period, another 
^ input is selected as the output. Thus, the TDM circuit manages to use only one output wire 
between chips so that, for this example, the hardware model of the circuit implemented in a 
i2 particular chip can be accomplished using 44 pins, instead of 45 pins. Thus, the routing task can 
^ n always be completed because the pins can be arranged mto time division multiplex form among 
O 15 the chips. 

i2 FIG. 9(A) shows an overview of the pin-out problem. Since this requires the TDM 

^1 circuit, FIG. 9(B) provides a TDM circuit for the transmission side, and FIG. 9(C) provides a 
U TDM circuit for the receiver side. These figures show only one particular example in which the 
SEmulation system requires one wire instead of two wires between chips. If more than two wkes 
20 must be coupled together in a time multiplexed arrangement, one ordinarily skilled in the art can 
make the appropriate modifications in light of the teachings below. 

FIG. 9(A) shows one embodiment of the TDM circuit in which the SEmulation system 
couples two wires in a TDM configuration. Two chips, 990 and 991, are provided. A circuit 
960 which is portion of a complete user circuit design is modeled and placed in chip 991. A 
25 circuit 973 which is portion of a complete user circuit design is modeled and placed in chip 990. 
Several interconnections, including a group of interconnections 994, interconnection 992, and 
interconnection 993, are provided between circuit 960 and circuit 973. The number of 
mtercoimections, in this example, total 45. If, in one embodiment, each chip provides only 44 
pins at most for these intercoimections, one embodiment of the present invention provides for at 
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least two of the interconnections to be time multiplexed to require only one interconnection 
between these chips 990 and 991. 

In this example, the group of interconnections 994 will continue to use the 43 pins. For 
the 44^ and last pin, a TDM circuit in accordance with one embodiment of the present invention 
5 can be used to couple interconnections 992 and 993 together in time division multiplexed form. 

FIG. 9(B) shows one embodunent of the TDM circuit. A modeled circuit (or a portion 
thereof) 960 within a FPGA chip 991 provides two signals on wires 966 and 967, To the circuit 
960, these wires 966 and 967 are outputs. These outputs would normally be coupled to modeled 
circuit 973 in chip 990 (see FIGS. 9(A) and 9(C)). However, the availability of only one pin for 
C310 these two output wires 966 and 967 precludes a direct pin-for-pin connection. Because the 
2 outputs 966 and 967 are uni-directionally transmitted to the other chip, appropriate transmission 
HI and receiver TDM circuits must be provided to couple these lines together. One embodiment of 
''4 the transmission side TDM circuit is shown in FIG. 9(B). 

m The transmission side TDM circuit inchides AND gates 961 and 962, whose respective 

U 15 outputs 970 and 971 are coupled to the inputs of OR gate 963. The output 972 of OR gate 963 is 
the output of the chip assigned to a pin and connected to another chip 990. One set of inputs 966 
fU and 967 to AND gates 961 and 962, respectively, is provided by the circuit model 960. The 
y. Other set of mputs 968 and 969 is provided by a looped register scheme which functions as the 
time division multiplexed selector signal. 
20 The looped register scheme includes registers 964 and 965. The output 995 of register 

964 is provided to the input of register 965 and the input 968 of AND gate 961. The output 996 
of register 965 is coupled to the input of register 964 and the input 969 to AND gate 962. Each 
register 964 and 965 is controlled by a common clock source. At any given instant in time, only 
one of the outputs 995 or 996 provides a logic "1." The other is at logic "0." Thus, after each 
25 clock edge, the logic " V shifts between output 995 and output 996. This in turn provides either 
a " 1" to AND gate 961 or AND gate 962, "selecting" either the signal on wire 966 or wire 967. 
Thus, the data on wire 972 is from circuit 960 on either wire 966 or wire 967. 

One embodiment of the receiver side portion of the TDM circuit is shown in FIG. 9(C). 
The signals from ckcuit 960 on wires 966 and wire 967 in chip 991 (FIGS. 9(A) and 9(B)) must 
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be coupled to the appropriate wires 985 or 986 to the circuit 973 in FIG. 9(C). The time division 
multiplexed signals from chip 991 enter from wire/pin 978. The receiver side TDM circuit can 
couple these signals on wire/pin 978 to the appropriate wires 985 and 986 to circuit 973. 

The TDM circuit includes input registers 974 and 975. The signals on wire/pin 978 are 
5 provided to these input registers 974 and 975 via wires 979 and 980, respectively. The output 

985 of input register 974 is provided to the appropriate port in circuit 973. Similarly, the output 

986 of input register 975 is provided to the appropriate port in circuit 973. These input registers 
974 and 975 are controlled by looped registers 976 and 977. 

The output 984 of register 976 is coupled to the input of register 977 and the clock input 
QlO 981 of register 974. The output 983 of register 977 is coupled to the input of register 976 and 
vj the clock input 982 of register 975. Each register 976 and 977 is controlled by a common clock 
li source. At any given instant in time, only one of the enable inputs 981 or 982 is a logic "1." 

The other is at logic "0." Thus, after each clock edge, the logic "1" shifts between enable mput 
in 981 and output 982. This in turn "selects" either the signal on wire 979 or wire 980. Thus, the 
□ 15 data on wire 978 from circuit 960 is appropriately coupled to circuit 973 via either wire 985 or 
r! wire 986. 

[y The address pointer in accordance with one embodiment of the present invention, as 

M: discussed briefly with respect to FIG. 4, will now be discussed in greater detail with respect to 
FIG. 10. To reiterate, several address pointers are located in each FPGA chip in the hardware 
20 model. Generally, the primary purpose for implementing the address pointers is to enable the 
system to deliver data between the software model 315 and the specific FPGA chip in the 
hardware model 325 via the 32-bit PCI bus 328 (refer to FIG. 10). More specifically, the 
primary purpose of the address pomter is to selectively control the data delivery between each of 
the address spaces (i.e., REG, S2H, H2S, and CLK) in the software/hardware boundary and each 
25 FPGA chip among the banks 326a-326d of FPGA chips in light of the bandwidth lunitations of 
the 32-bit PCI bus. Even if a 64-bit PCI bus is implemented, these address pointers are still 
needed to control the data delivery. Thus, if the software model has 5 address spaces (i.e., REG 
read, REG write, S2H read, H2S write, and CLK write), each FPGA chip has 5 address pointers 
corresponding to these 5 address spaces. Each FPGA needs these 5 address pointers because the 
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particular selected word in the selected address space being processed may reside in any one or 
more of the FPGA chips. 

The FPGA I/O controller 381 selects the particular address space (i.e., REG, S2H, H2S, 
and CLK) corresponding to the software/hardware boundary by using a SPACE index. Once the 
5 address space is selected, the particular address pointer corresponding to the selected address 
space in each FPGA chip selects the particular word corresponding to the same word in the 
selected address space. The maximum sizes of the address spaces in the software/hardware 
boundary and the address pointers in each FPGA chip depend on the memory/word capacity of 
the selected FPGA chip. For example, one embodiment of the present invention uses the Altera 
C^llO FLEX lOK family of FPGA chips. Accordingly, estimated maximum sizes for each address 
3 space are: REG, 3,000 words; CLK, 1 word; S2H, 10 words; and H2S, 10 words. Each FPGA 
''i; chip is capable of holding approximately 100 words. 

H The SEmulator system also has the feature of allowing the user to start, stop, assert input 

In values, and inspect values at any time in the SEmulation process. To provide the flexibility of a 
p. 15 simulator, the SEmulator must also make all the components visible to the user regardless of 
a whether the internal realization of a component is in software or hardware. In software, 
fU combinational components are modeled and values are conq)uted during the simulation process. 
{I Thus, these values are clearly "visible" for the user to access at any time during the simulation 
process. 

20 However, combinational component values in the hardware model are not so directly 

"visible." Although registers are readily and directly accessible (i.e., read/write) by the software 
kernel, combinational components are more difficult to determine. In FPGAs, most 
combinational components are modeled as look-up tables in order to achieve high gate utilization. 
As a result, the look-up table mapping provides efficient hardware modeling but loses visibility 

25 of most of the combinational logic signals. 

Despite these problems with lack of visibility of combinational components, the 
SEmulation system can rebuild or regenerate combinational components for inspection by the user 
after the hardware acceleration mode. If a user's circuit design has only combinational and 
register components, the values of all the combinational components can be derived from the 
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register components. That is, combinational components are constructed from or contain 
registers in various arrangements in accordance with the specific logic function required by the 
circuit design. The SEmulator has hardware models of register and combinational components 
only, and as a result, the SEmulator will read all the register values from the hardware model and 
5 then rebuild or regenerate all the combinational components. Because of the overhead required to 
perform this regeneration process, combinational component regeneration is not performed all the 
time; rather, it is performed only upon request by the user. Indeed, one of the benefits of using 
the hardware model is to accelerate the simulation process. Determining combinational 
component values at every cycle (or even most cycles) further decreases the speed of simulation. 

So In any event, inspection of register values alone should be sufficient for most simulation analyses. 

^ The process of regenerating combinational component values from register values assumes 

JS that the SEmulation system was in the hardware acceleration mode or ICE mode. Otherwise, 
software simulation already provides combinational component values to the user. The 
SEmulation system maintains combinational component values as well as register values that were 

13 15 resident in the software model prior to the onset of hardware acceleration. These vahies remain 
in the software model until further over-writing action by the system. Because the software 
model already has register values and combinational component values from the time period 
immediately before the onset of the hardware acceleration run, the combinational component 
regeneration process involves updating some or all of these values in the software model in 
20 response to updated input register values. 

The combinational component regeneration process is as follows: First, if requested by 
the user, the software kernel reads all the output values of the hardware register components from 
the FPGA chips into the REG buffer. This process involves a DMA transfer of register values in 
the FPGA chips via the chain of address pointers to the REG address space. Placing register 
25 values that were in the hardware model into the REG buffer, which is in the software/hardware 
boundary, allows the software model to access data for fixrther processing. 

Second, the software kernel compares the register values before the hardware acceleration 
run and after the hardware acceleration run. If the register values before the hardware 
acceleration run are the same as the values after the hardware acceleration run, the values in the 
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combinational components have not changed. Instead of expending time and resources to 
regenerating combinational components, these values can be read from the software model, which 
already has combinational component values stored therein from the time immediately before the 
hardware acceleration run. On the other hand, if one or more of these register values have 
5 changed, one or more combinational components that depend on the changed register values may 
also change values. These combinational components must be regenerated through the following 
third step. 

Third, for registers with different values from the before-acceleration and after- 
acceleration comparison, the software kernel schedules their fan-out combinational components 

10 into the event queue. Here, those registers that changed values during this acceleration run have 
detected an event. More than likely, these combinational components that depend on these 
changed register values will produce different values. Regardless of any change in value in these 
combinational components, the system ensures that these combinational components evaluate 
these changed register values hi the next step. 

15 Fourth, the software kernel then executes the standard event simulation algorithms to 

propagate the value changes from the registers to all the combinational components in the 
software model. In other words, the register values that changed during the before-acceleration 
to after-acceleration time interval are propagated to all combinational components downstream 
that depend on these register values. These combinational components then evaluate these new 

20 register values. In accordance with fan-out and propagation principles, other second-level 
combinational components that are located downstream from the first-level combinational 
components that in turn directly rely on the changed register values must also evaluate the 
changed data, if any. This process of propagating register values to other components 
downstream that may be affected continues to the end of the fan-out network. Thus, only those 

25 combinational components located downstream and affected by the changed register values are 
updated in the software model. Not all combinational component values are affected. Thus, if 
only one register value changed during the before-acceleration to after-acceleration time interval, 
and only one combinational component is affected by this register value change, then only this 
combinational component will re-evaluate its value in light of this changed register value. Other 
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portions of the modeled circuit will be unaffected. For this small change, the combinational 
component regeneration process will occur relatively fast. 

Finally, when event propagation has completed, the system is ready for any mode of 
operation. Usually, the user desires to inspect values after a long run. After the combinational 
5 component regeneration process, the user will continue with pure software simulation for 

debug/test purposes. However, at other times, the user may wish to continue with the hardware 
acceleration to the next desked point. Still in other cases, the user may wish to proceed further 
with ICE mode. 

In sum, combinational component regeneration involves using register values to update 
^0 combinational component vahies in the software model. When any register value has changed, 
J5 the changed register vahie will be propagated through that register's fan-out network as values are 
^ updated. When no register value has changed, the values in the software model also will not 
change, so the system does not need to regenerate combinational components. Usually, the 
hardware acceleration run will occur for some time. As a result, many register values may 
Ol5 change, affecting many combinational component values located downstream in the fan-out 
H network of these registers that have the changed values. In this case, the combinational 
f5 component regeneration process may be relatively slow. In other cases, after a hardware 

acceleration run, only a few register values may change. The fan-out network for registers that 
had the changed register values may be small and thus, the combinational component regeneration 
20 process may be relatively fast. 

IV. EMULATION WITH TARGET SYSTEM MODE 

FIG. 10 shows a SEmulation system architecture in accordance with one embodiment of 
the present invention. FIG. 10 also shows a relationship between the software model, hardware 
25 model, the emulation interface, and the target system when the system is operating in in-circuit 
emulation mode. As described earlier, the SEmulation system comprises a general purpose 
microprocessor and a reconfigurable hardware board interconnected by a high-speed bus, such as 
a PCI bus. The SEmulation system compiles the user's circuit design and generates the emulation 
hardware configuration data for the hardware model-to-reconfigurable board mapping process. 
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The user can then simulate the circuit through the general purpose processor, hardware accelerate 
the sunulation process, emulate the circuit design with the target system through the emulation 
interface, and later perform post-simulation analysis. 

The software model 315 and hardware model 325 are determined during the compilation 
5 process. The emulation mterface 382 and the target system 387 are also provided in the system 
for in-circuit emulation mode. Under the user's discretion, the emulation interface and the target 
system need not be coupled to the system at the outset. 

The software model 315 includes the kernel 316, which controls the overall system, and 
four address spaces for the software/hardware boundary - REG, S2H, H2S, and CLK. The 

10 SEmulation system maps the hardware model into four address spaces in main memory according 
to different component types and control functions: REG space 317 is designated for the register 
components; CLK space 320 is designated for the software clocks; S2H space 318 is designated 
for the output of the software test-bench components to the hardware model; and H2S space 319 
is designated for the output of the hardware model to the software test-bench components. These 

15 dedicated I/O buffer spaces are mapped to the kernel's main memory space during system 
initialization time. 

The hardware model includes several banks 326a-326d of FPGA chips and FPGA I/O 
controller 327. Each bank (e.g., 326b) contains at least one FPGA chip. In one embodiment, 
each bank contams 4 FPGA chips. In a 4x4 array of FPGA chips, banks 326b and 326d may be 

20 the low bank and banks 326a and 326c may be the high bank. The mapping, placement, and 
routing of specific hardware-modeled user circuit design elements to specific chips and their 
intercoimections are discussed with respect to FIG. 6. The interconnection 328 between the 
software model 315 and the hardware model 325 is a PCI bus system. The hardware model also 
includes the FPGA I/O controller 327 which includes a PCI interface 380 and a control unit 381 

25 for controlling the data traffic between the PCI bus and the banks 326a-326d of FPGA chips 
while maintaining the throughput of the PCI bus. Each FPGA chip ftirther includes several 
address pointers, where each address pointer corresponds to each address space (i.e., REG, S2H, 
H2S, and CLK) in the software/hardware boundary, to couple data between each of these address 
spaces and each FPGA chip in the banks 326a-326d of FPGA chips. 
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Communication between the software model 315 and the hardware model 325 occurs 
through a DMA engine or address pointer in the hardware model. Alternatively, communication 
also occurs through both the DMA engine and the address pointer in the hardware model. The 
kernel initiates DMA transfers together with evaluation requests through direct mapped I/O 
5 control registers. REG space 317, CLK space 320, S2H space 318, and H2S space 319 use I/O 
datapath lines 321, 322, 323, and 324, respectively, for data delivery between the software model 

315 and the hardware model 325. 

Double buffering is required for all primary inputs to the S2H and CLK spaces because 
these spaces take several clock cycles to complete the updating process. Double buffering avoids 
C3lO disturbing the internal hardware model states which may cause race conditions. 

The S2H and CLK space are the primary input from the kernel to the hardware model. As 
>^ described above, the hardware model holds substantially all the register components and the 

i. ; 

y combinational components of the user's circuit design. Furthermore, the software clock is 
In modeled in software and provided in the CLK I/O address space to interface with the hardware 

15 model. The kernel advances simulation time, looks for active test-bench components, and 
fj evaluates clock components. When any clock edge is detected by the kernel, registers and 
fll memories are updated and values through combinational components are propagated. Thus, any 
|I changes in values in these spaces will trigger the hardware model to change logic states if the 
hardware acceleration mode is selected. 
20 For in-circuit emulation mode, emulation interface 382 is coupled to the PCI bus 328 so 

that it can conmiunicate with the hardware model 325 and the software model 315. The kernel 

316 controls not only the software model, but also the hardware model during the hardware 
accelerated simulation mode and the in-circuit emulation mode. The emulation interface 382 is 
also coupled to the target system 387 via cable 390. The emulation interface 382 also includes 

25 the interface port 385, emulation I/O control 386, the target-to-hardware I/O buffer (T2H) 384, 
and the hardware-to-target I/O buffer (H2T) 383. 

The target system 387 includes a connector 389, a signal-in/signal-out interface socket 
388, and other modules or chips that are part of the target system 387. For example, the target 
system 387 could be an EGA video controller, and the user's circuit design may be one particular 
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I/O controller circuit. The user's circuit design of the I/O controller for the EGA video 
controller is completely modeled in software model 315 and partially modeled in hardware model 
325. 

The kernel 316 in the software model 315 also controls the in-circuit emulation mode. The 
5 control of the emulation clock is still in the software via the software clock, the gated clock logic, 
and the gated data logic so no set-up and hold-time problems will arise during in-circuit emulation 
mode. Thus, the user can start, stop, single-step, assert values, and inspect values at any time 
during the in-circuit emulation process. 

To make this work, all clock nodes between the target system and the hardware model are 
13 10 identified. Clock generators in the target system are disabled, clock ports from the target system 
5 are disconnected, or clock signals from the target system are otherwise prevented from reaching 
%^ the hardware model. Instead, the clock signal originates from a test-bench process or other form 
y of software-generated clock so that the software kernel can detect active clock edges to trigger the 
m data evahiation. Hence, in ICE mode, the SEmulation system uses the software clock to control 
fn 15 the hardware model instead of the target system's clock. 

To simulate the operation of the user's circuit design within the target system's 
fU environment, the primary input (signal-in) and output (signal-out) signals between the target 

system 40 and the modeled circuit design are provided to the hardware model 325 for evaluation. 
This is accomplished through two buffers, the target-to-hardware buffer (T2H) 384 and the 
20 hardware-to-target buffer (H2T) 383. The target system 387 uses the T2H buffer 384 to apply 
input signals to the hardware model 325. The hardware model 325 uses the H2T buffer 383 to 
deliver output signals to the target system 387. In this in-circuit emulation mode, the hardware 
model send and receive I/O signals through the T2H and H2T buffers instead of the S2H and H2S 
buffers because the system is now using the target system 387, instead of test-bench processes in 
25 the software model 315 to evaluate the data. Because the target system runs at a speed 

substantially higher than the speed of the software simulation, the in-circuit emulation mode will 
also run at a higher speed. The transmission of these input and output signals occurs on the PCI 
bus 328. 

Furthermore, a bus 61 is provided between the emulation interface 382 and the hardware 

57 

SV/225583.01 
16503.302504 



model 325. This bus is analogous to the bus 61 in FIG. 1. This bus 61 allows the emulation 
interface 382 and the hardware model 325 to communicate via the T2H buffer 384 and the H2T 
buffer 383. 

Typically, the target system 387 is not coupled to the PCI bus. However, such a coupling 
5 may be feasible if the emulation interface 382 is incorporated in the design of the target system 
387. In this set-up, the cable 390 will not be present. Signals between the target system 387 and 
the hardware model 325 will still pass through the emulation interface. 

V. POST-SIMULATION ANALYSIS MODE 
qIO The SEmulation system of the present invention can support value change dump (VCD), a 

^ widely used simulator function for post-simulation analysis. Essentially, the VCD provides a 
'^'T historical record of all inputs and selected register outputs of the hardware model so that later, 
•~j during post-simulation analysis, the user can review the various inputs and resultmg outputs of 
In the simulation process. To support VCD, the system logs all inputs to the hardware model. For 
%^ 15 outputs, the system logs all values of hardware register components at a user-defined logging 
S frequency (e.g., 1/10,000 record/cycle). The logging frequency determines how often the output 
m values are recorded. For a logging fr*equency of 1/10,000 record/cycle, output vahies are 
;f recorded once every 10,000 cycles. The higher the logging frequency, the more information is 
recorded for later post-simulation analysis. The lower the logging frequency, the less 
20 information is stored for later post-simulation analysis. Because the selected logging frequency 
has a causal relationship to the SEmulation speed, the user should select the logging frequency 
with care. A higher logging frequency will decrease the SEmulation speed because the system 
must spend time and resources to record the output data by performing I/O operations to memory 
before further simulation can be performed. 
25 With respect to the post-simulation analysis, the user selects a particular point at which 

simulation is desired. If the logging frequency is 1/500 records/cycle, register values are 
recorded for points 0, 500, 1000, 1500, and so on every 500 cycles. If the user wants results at 
point 610, for example, the user selects point 500, which is recorded, and simulates forward in 
time until the simulation reaches point 610. During the analysis stage, the analysis speed is the 
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same as the simulation speed because the user initially accesses data for point 500 and then 
simulates forward to point 610. Note that at higher logging frequencies, more data is stored for 
post-simulation analysis. Thus, for a logging frequency of 1/300 records/cycle, data is stored for 
points 0, 300, 600, 900, and so on every 300 cycles. To obtain results at point 610, the user 
5 initially selects point 600, which is recorded, and simulates forward to point 610. Notice that the 
system can reach the desired point 610 faster during post-simulation analysis when the logging 
frequency is 1/300 than 1/500. However, this is not always the case. The particular analysis 
point in conjunction with the logging frequency determines how fast the post-simulation analysis 
point is reached. For example, the system can reach point 523 faster if the VCD logging 
JglO frequency was 1/500 rather than 1/300. 

f£ The user can then perform analysis after SEn^iulation by running the software simulation 

zfl with input logs to the hardware model to compute the value change dump of all hardware 
il components. The user can also select any register log point in time and start the value change 
^ ' dump from that log point forward in time. This value change dump method can link to any 
0 15 simulation waveform viewer for post-simulation analysis. 

^ VCD On-Demand System 

^"^^ One embodiment of the present invention is a system that generates VCD on demand 

without simulation rerun. In accordance with one embodiment of the present invention, the VCD 

20 on-demand technology as described herein incorporates the following high level attributes: (1) 
RCC-based parallel simulation history compression and recording, (2) RCC-based parallel 
simulation history decompression and VCD file generation, and (3) On-demand software 
regeneration for a selected simulation target range and design review without simulation rerun. Each 
of these attributes will be discussed in greater detail below. 

25 During a debug session, the EDA tool (hereinafter referred to as the RCC System, which 

incorporates the various aspects of the present invention) records the primary inputs from a test 
bench process so that any portion of the simulation can be reproduced. The user can then 
selectively command the EDA tool, or RCC System, to dump the hardware state information from 
any simulation time range into a VCD file for later analysis. Thereafter, the user can immediately 
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begin debugging his design in the selected simulation time range. If the selected simulation time 
range does not include the bug that the user is seeking to fix, he can select another simulation time 
range for dump into the VCD file. The user can then analyze this new VCD file. With this VCD 
on-demand feature, the user can cease simulation at any point and request the generation of another 
5 selective VCD file on-demand fi:om any desired simulation time starting point to any simulation 
time end point. 

In a typical debug session, the user debugs his design using the RCC System illustrated in 
FIG. 83. During the first simulation run, the user fast simulates his design fi-om a desired beginning 
simulation time to any desired end simulation time, referred to herein as a simulation session range . 

QlO During this fast simulation run, a highly compressed form of the primary inputs is recorded in an 

i£ "input history" file so that any portion of the simulation session can be reproduced. At the end of 
the simulation session range, the RCC System saves the hardware state information fvom this end 

y point in a "simulation history" file so that the user can return to debugging the design past this end 

m point if desired. 

n 15 At the end of the fast simulation run, the user will analyze the results and invariably detect 

some problem with his design. The user then makes a guess that the source of the problem (i.e., 

f U bug) is located in a particular narrow simulation time range, referred to herein as the simulation 
target range , which is within the broader simulation session range. For example, if the simulation 
session range encompassed 1,000 simulation time steps, the narrower simulation target range might 
20 include only 100 simulation time steps at a particular location within the broader simulation session 
range. 

Once the user makes a guess as to the precise location of simulation target range to isolate 
the bug, the RCC System fast simulates fi*om the beginning by decompressing the compressed 
primary inputs in the input history file and delivering the decompressed primary inputs into the 
25 hardware model for evaluation. When the RCC System reaches the simulation target range, it 
dumps the evaluated results (e.g., hardware node values and register states) into a VCD file. 
Thereafl:er, the user can analyze this region more carefiilly by replaying his design using the VCD 
file starting firom the beginning of the simulation target range, rather than having to rerun the 
simulation fi:*om the beginning of the simulation session range, or even fi-om the very beginning of 
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the simulation. This feature of saving the hardware states from the simulation target range as a VCD 
file saves the user an enormous amount of debug time - time that is not otherwise wasted on 
simulation rerun. 

Referring now to FIG. 83, a high level view of the RCC System that incorporates one 
5 embodiment of the present invention is illustrated. The RCC System includes an RCC Computing 
System 2600 and an RCC Hardware Accelerator 2620. As described elsewhere in this patent 
specification, the RCC Computing System 2600 contains the computational resources that are 
necessary to allow the user to simulate the user's entire software-modeled design in software and 
control the hardware acceleration of the hardware-modeled portion of the design. To this end, the 
yiO RCC Computing System 2600 contains the CPU 2601, various clocks 2602 (including the software 
a clock that is described elsewhere in this patent specification) that are needed by the various 

components of the RCC System, test bench processes 2603, and system disk 2604. In contrast to 
y. some conventional hardware-based event history buffer, the system disk is used to record the 
in compressed data rather than a small hardware RAM buffer. Although not shown, the RCC 
p 15 Computing System 2600 includes other logic components and bus subsystems that provide the 
ri circuit designer with the computational power to run diagnostics, various software, and manage 
f U files, among other tasks that a comptiting system performs. 

M The RCC Hardware Accelerator 2620, which is also referred to as the RCC Array in other 

sections of this patent specification, contains the reconfigurable array of logic elements (e.g., 

20 FPGA) that can model at least a portion of the user*s design in hardware so that the user can 

accelerate the debugging process. To this end, the RCC Hardware Accelerator 2620 includes the 
array of reconfigurable logic elements 2621 which provides the hardware model of a portion of the 
user design. The RCC Computing System 2600 is tightly coupled to the RCC Hardware 
Accelerator 2620 via the software clock as described elsewhere in this patent specification and a bus 

25 system, a portion of which is shown as lines 2610 and 261 1 in FIG. 83. 

The VCD on-demand aspect of the present invention will now be discussed with respect to 
FIG. 84, FIG. 84 shows a timeline of several simulation times - tO, tl, t2, and t3. The simulation 
session range is between simulation time tO and simulation time t3, which of course includes 
simulation times tl and t2. Simulation time tO represents the first simulation time in the simulation 
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session range where fast simulation begins. This sinnulation time tO represents the first simulation 
time for any separable simulation session, or simulation session range. In other words, assume that 
today's debug session includes an examination of the simulation session range from t^l 0,000 to 
t=l 2,000. The user guesses that the particular bug is located somewhere between t=l 0,500 and 
5 t=l 0,750. For this simulation session range, the simulation time to is t= 10,000. Assume that the 
particular bug is located and fixed for this simulation session range t=l 0,000 to t=12,000. 
Tomorrow, the user then moves on to the next simulation session range t=12,000 to t=15,000. Here, 
the simulation time tO is t=l 2,000. In some cases, simulation time tO represents the very first 
simulation time for the user design's first debug session; that is, tO corresponds to t==0. 
f3lO Analogously, simulation time t3 represents the last simulation time for the selected 

simulation session range. In other words, assume that today's debug session includes an 
examination of the simulation session range from t^l4,555 to t=16,750. For this simulation session 
range, the simulation time t3 is t=l 6,750. Assume that the particular bug is located and fixed for 
111 this simulation session range t=l 4,555 to t==l 6,750. The user then moves on to the next simulation 
□ 15 session range t=16,750 to t^l9,100. Here, the simulation time t3 is t=19,100. In some cases, 
r: simulation time t3 represents the very last simulation time for the user design's last debug session, 
fU The user may continue to simulate beyond this simulation time t3 if desired but for the 

moment, he is focused on debugging his design for the simulation times tO to t3, the current 
simulation session range. Typically, when the bugs have been ironed out for the current simulation 
20 session range, the user will then proceed to simulate his design beyond simulation time t3 into the 
next simulation session range. 

In this abstract representation of the simulation session range, these simulation time periods 
tO-t3 are not necessarily contiguous to each other; that is, simulation time tO and tl are not 
immediately adjacent to each other. Indeed, simulation times tO and tl may be thousands of 
25 simulation time periods apart. 

Because one embodiment of the present invention will be implemented in the RCC System, 
references to various components of the RCC System shown in FIG. 83 will be made. First, the 
RCC System's input and simulation history generation operation will be discussed. This generation 
operation includes some form of data compression for the primary inputs and recordation of the 
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compressed primary inputs. Second, the RCC System's VCD generation operation will be 
discussed. This VCD generation operation includes decompressing the primary inputs to reproduce 
the simulation history and dumping the hardware states into a VCD file for the simulation target 
range. Third, the VCD file review process is then discussed. Although the term "simulation 
5 history" is used at times, this does not mean that the entire debug session involves software 

simulation, hideed, the RCC System generates VCD files fi*om hardware states and the software 
model is used only for later analysis of the VCD file. 

Input and Simulation History Generation ~ Compress and Record 
Q 10 At the outset, the user models the design in software in the RCC Computing System 2600 of 

iO FIG. 83. For some portion of the design, the RCC Computing System 2600 automatically generates 

IB 

^ a hardware model of the design based on the hardware description language (e.g., VHDL). The 

hardware model is configured in the array of reconfigurable logic elements 2621 , which is a portion 

in of the RCC Hardware Accelerator 2620. With this setup, the user can simulate the design in 

p 15 software in the RCC Computing System 2600, accelerate a portion (i.e., simulation time step or 

distinct physical section of the circuit) of the design using the RCC Hardware Accelerator 2620, or a 

^Ji combination of simulation and hardware acceleration. 

u The user has just completed his latest circuit design. It is now time to debug the design to 

look for flaws. If the user had previously debugged an earher version of the design, he has some 

20 idea of where a bug might be located. On the other hand, if this is the very first debug session for 
this new design, the user must make some guess as to the location of a potential bug. In either case, 
some guess work is needed to generally locate the bug. For the purposes of this discussion, assume 
is debugging the design for the very first time. 

In debugging the design, the user selects a simulation session range. Theoretically, this 

25 simulation session range can be any length of simulation times. In practice, however, the simulation 
session range should be selected to be short enough to isolate a few bugs in the design and long 
enough to quickly move the debugging process and minimize the number of debug sessions 
necessary to fiiUy debug a design. Obviously, a simulation session range of two or three simulation 
time steps will not reveal the existence of any bug. Furthermore, this small simulation session range 
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will force the user to conduct many repetitive tasks that will slow the debug process. If the selected 
simulation session range is a million simulation time steps, too many bugs may manifest themselves 
and thus, the user will be find difficulty in implementing a more focused attack of the problem. 

Once the user has selected a simulation session range, he commands the RCC System to fast 
5 simulate from simulation time tO to simulation time tS, as shown in FIG. 84. As explained above, 
the separation of the simulation times tO to t3 may be any selected range, but simulation time tO 
represents the beginning of the simulation and simulation time t3 represents the last simulation time 
for this simulation session range. 

At simulation time tO, fast simulation begins in the RCC Computing System 2600. Fast 
0 10 simulation is performed from simulation time tO to simulation time t3 instead of normal simulation 
mode because no regeneration of the software model is needed during this time period. As described 
elsewhere in this patent specification, the regeneration operation requires the RCC Computing 
y System 2620 to receive hardware state information (e.g., node values, register states) so that more 
ifl sophisticated logic elements (e.g., combhiational logic) can be regenerated in software for fiirther 
f% 15 analysis by the user. Of course, some users may want to view the software model during the 

simulation process, in which case, the RCC Computing System 2600 does not perform fast 
ry simulation. In this case, the simulation process is much slower due to the extra time needed by the 
|I RCC Computing System 2600 to regenerate the software model from the primary outputs of the 
hardware model 

20 Initially, the fiill states of the design, such as the software model states and hardware model 

register and node values, are saved at simulation time tO into a file, called ''simulation history" file, 
in the system disk. This allows the user to load the states of the design into the RCC System at any 
time in the fixture for debugging purposes. During this fast simulation period for the simulation 
session range from simulation time tO to simulation time t3, the RCC Computing System 2600 

25 applies two distinct processes to the primary inputs Ip in parallel. The raw primary inputs from the 
test bench processes 2603 are provided on line 2610 to the RCC Hardware Accelerator 2620 for 
evaluation. Concurrently, the same primary inputs from the test bench processes are compressed 
and recorded in system disk as a separate file, called an "input history" file, so that the entire history 
of the primary inputs can be collected to allow the user to reproduce any part of the simulation later. 
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In particular, the primary inputs corresponding to simulation time tO to simulation time t3 are 
compressed and saved in system disk. 

When the RCC Hardware Accelerator 2620 receives the primary inputs Ip from the test 
bench processes 2603, it processes the primary inputs. As a result, hardware states in the hardware 
5 model will most Hkely change as the various logic and other circuit devices evaluate the data. 

During this period from simulation time tO to simulation time t3, the RCC System need not wait for 
the RCC Computing System 2600 to perform its logic regeneration since the user is not interested in 
finely debugging the design during this fast simulation period. The RCC System also does not save 
the primary outputs (e.g., hardware node values and register states) yet. Note that while the RCC 
10 Computing System 2600 compresses the primary inputs for recording into the 'Input history" file, 
^ the RCC Hardware Accelerator 2620 evaluates the raw and uncompressed primary inputs. In other 
in embodimexits, the RCC System does not compress the primary inputs for recording into the input 
history file. 

iZ Why does the RCC Computing System 2600 deliver the primary inputs to the RCC 

15 Hardware Accelerator for evaluation when these outputs will not be saved at all during the fast 
%Q simulation period? The RCC System needs to save the hardware states of the design based on its 
ST evaluation of the primary inputs from the beginning of the simulation to simulation time t3. An 
y accurate snapshot of the hardware model states cannot be obtained at simulation time t3 unless the 
hardware model has evaluated the entire history of primary inputs from the beginning to this point 
20 t3, not the inputs from just simulation time t3. Logic circuits have memory attributes that will affect 
the results of the evaluation based on the order of the inputs. Thus, if the primary inputs from just 
simulation time t3 (or the simulation time immediately prior to simulation time t3) are fed to the 
hardware model for evaluation, the hardware model will probably exhibit the wrong states at this 
simulation time t3. 

25 Why is the hardware model states saved for simulation time t3? A large design with over a 

miUion gates and over a million simulation time steps cannot be debugged in a relatively short 
period of time. The user needs multiple simulation sessions to debug this design. To quickly move 
from one simulation session to the next, the RCC System saves the hardware states (along with the 
compressed primary inputs) from simulation time t3 so that the user can debug the next simulation 
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session range which begins at simulation time t3. With the saved hardware model states, the user 
need not simulate from the very beginning of the simulation; rather, the user can quickly and 
conveniently return to simulation time t3 after debugging the design from simulation time tO to 
simulation time t3. The hardware model states at simulation time tS, saved in the simulation history 
5 file, represent the correct snapshot of his design that is a reflection of the entire history of primary 
inputs up to that point. 

The hardware model in the RCC Hardware Accelerator 2620 provides internal hardware 
states on Kne 2611 to the RCC Computing System 2600, so that the RCC Computing System 2600 
can build or regenerate the various logic elements (e.g., combinational logic) in the software model, 
10 if necessary and desired by the user. But, as noted above, the user is not concerned with observing 
S the software simulation during the fast simulation of the simulation session range. Accordingly, 
Lfl these internal hardware states from the RCC Hardware Accelerator are not saved in the system disk, 
Tl since the internal hardware states will not be examined by the user for bugs for now. 

At simulation time t3, or at the end of the simulation session range, this particular fast 
s 15 simulation operation ceases. The evaluation results or primary outputs (e.g., register values) from 
^ the design's hardware model in the RCC Hardware Accelerator 2620 coiresponding to simulation 
;!! time t3 are saved in the simulation history file. This is done so that when the user has debugged the 
Q design from simulation time tO to simulation time t3, he can then proceed straight to simulation time 
t3 for fiirther debugging as necessary. The user need not rerun the simulation from simulation time 
20 to to debug his design at some point beyond simulation time t3. 

In sum, from simulation time tO to simulation time t3 (i.e., simulation session range), the 
user is essentially accelerating the design by feeding the RCC Hardware Accelerator 2620 with the 
primary inputs from the test bench process 2603 on line 2610 while at the same time compressing 
the same primary inputs and saving them into system disk for ftiture reference. The RCC 
25 Computing System 2600 needs to save the primary inputs (compressed or otherwise) in the input 
history file to reproduce the debug session. The compression operation also occurs in parallel with 
the data evaluation in the RCC Hardware Accelerator 2620. Finally, at simulation time t3 at the end 
of the simulation session range, the RCC System saves the state information of the hardware model 
into a simulation history file. 
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# II 

In one embodiment of the present invention, all recorded compressed primary inputs from 
the simulation session range are part of the same file that will be modified later for the hardware 
state information from simulation time t3. In another embodiment, the saved information from the 
simulation session range and the hardware state information from simulation time t3 are each saved 
5 as distinct files in system disk. Similarly, any of the above described files may be modified with the 
VCD on-demand information that is created later for the simulation target range. Alternatively, the 
VCD on-demand information may be saved in a distinct VCD file in system disk that is separate 
from the compressed primary input file and the simulation time t3 hardware state information file. 
In other words, in accordance with one embodiment of the present invention, the input history file, 
10 the simulation history file, and the VCD file may be incorporated together in one file. In another 
^ embodiment, the input history file, the simulation history file, and the VCD file may be separate 
Iff files. Also, the input history file and the simulation history file may be incorporated in one file that 

is separate fi:om the VCD file. 
fZ, The compression scheme will now be discussed. In accordance with one embodiment of the 

^ 15 present invention, the RCC System's compression logic allows for a compression ratio of 20X for 
5 the primary input events with 10% input events per simulation time step. Thus, a large ASIC design 
ll] having over a million gates may require 200 primary input events. For 1 0% input events per 
O simulation time step, approximately 20 inputs need to be compressed and recorded. If each input 

signal is 2 bytes long, 20 input signals results in 40 bytes of data need to be processed at the primary 
20 inputs per simulation time step. For a compression ratio of 20X, the 40 bytes of data can be 
compressed to 2 bytes of data per simulation time step. Thus, for a design that requires about 1 
million simulation time steps, the RCC System compresses the primary inputs to 2 Mega bytes of 
data. A file of this size can be easily managed by any computing file system and the waveform 
viewer. In one embodiment, ZIP compression is used. 
25 In accordance with one embodiment, the primary input compression is performed in parallel 

with the primary input evaluation by the RCC Hardware Accelerator 2620; input history file 
generation occurs concurrently with the primary input evaluation. Accordingly, the compression 
scheme provides no direct negative impact on the RCC System's performance. The only possible 
bottleneck is the process of recording the compressed primary inputs into the system disk. 
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However, since the data is highly compressed, the RCC System experiences less than 5% slowdown 
for most designs running at 50,000 simulation time steps per second. 

As for the specific maimer in which recording is controlled in the RCC System, the user 
must first use the $rcc(record) command to initialize the RCC recording feature in accordance with 
5 one embodiment of the present invention: 

$rcc(record, name, <disk space>, <checkpoint controI>); 

An explanation of the arguments name, <disk space>, and <checkpoint control> will now be 
f 1 10 discussed. The "name*' argument is the record name for the current simulation session range. 
^ Different names are required to distinguish different simulation runs of the same design. A distinct 
in record name is needed especially for off-line VCD on-demand debugging. 

u The <disk space> argument is an optional parameter to specify the maximum disk space (in 

% units of MB) allocated for the RCC System recording process. The default value is 100 MB. The 
L 15 RCC System only records the latest part of the current simulation session range within the specified 

disk space. In other words, if the <disk space> value is specified as 100 MB but the current 
m simulation session range takes up 140 MB, the RCC System records only the last 100 MB while 

discarding the first 40 MB of compressed primary inputs. This aspect of the invention provides one 
benefit for failure analysis. In one embodiment of the present invention, the test bench process has 
20 some self-testing functions to detect simulation failures and stop the simulation. The latest history 
of the RCC simulation can provide most of the information for such failure analysis. 

The <checkpoint control> argument is an optional parameter that specifies the number of 
simulation time steps needed to perform a full-state checkpoint. The default is 1,000,000 time 
steps. Like most conventional compression algorithm, the compressed primary inputs are also 
25 based on the state difference between successive simulation time steps. For long simulation runs, 
checkpoints for the full RCC states at a given low frequency can greatly facilitate simulation history 
extraction. For a decompression rate of 20K to 200K simulation time steps per second in the RCC 
System and checkpoints located once every one million steps, the RCC System can extract (i.e.. 
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reproduction of the simulation from the primary inputs and selected VCD file generation) any 
simulation history within 5 to 50 seconds. 

When this $rcc(record) command is invoked, the RCC System will record the simulation 
history; that is, the primary inputs will be compressed and recorded in a file for storage in the 
system disk. The primary outputs from the RCC Hardware Accelerator are ignored since software 
logic regeneration is not needed at this time. The recording process can be terminated with either 
the commands $rcc(stop) or $rcc(off)j at which point the RCC System switches control of the 
simulation back to the software model. At this point, the primary outputs are processed for software 
logic regeneration. 

VCD Generation - Decompress and Dump 

As described above, the RCC System has saved the software model and hardware model at 
the beginning of the simulation session range at simulation time tO, recorded the compressed 
primary inputs for the entire simulation session range in the input history file, and saved the 
hardware model states for the design at the end of the simulation session range at simulation time t3 
in the simulation history file. The user now has enough information to load the design at the start of 
the simulation session range from the design information from simulation time tO. With the 
compressed primary inputs, the user can software simulate any portion of his design. However, 
with the VCD on-demand feature, the user will probably not want to software simulate his design at 
this point. Rather, the user will want to generate a VCD file for the selected simulation target range 
for fine analysis to isolate and fix the bug. Indeed, with the recorded compressed primary inputs, 
the RCC System can reproduce any point within the simulation session range. Moreover, the RCC 
System can simulate beyond the current simulation session range if desired by loading the 
previously saved hardware state information from simulation time t3. 

After fast simulating the design, the user reviews the results to determine if a bug exists. If 
no bug is apparent to the user, the design may be free of bugs for the current simulation session 
range. The user can then proceed to simulate beyond the current simulation session range to the 
next simulation session range, whatever selected range this may be. If, however, the user has 
determined that the design has some sort of problem, he must analyze the simulation more carefully 
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to isolate and fix the bug. Because the entire simulation session range is too large for careful and 
detailed analysis, the user must target a particular narrower range for further study. Based on the 
user's famiharity of the design and perhaps past debugging efforts, the user makes a reasonable 
guess as to the location of the bug within the simulation session range. The user will focus on a 
5 selected simulation target range that should correspond with the user's guess as to the location of 
the bug (or where the bug will manifest itself). The user determines that the simulation target range 
is between simulation time tl and simulation time t2 as shown in FIG. 84. 

The RCC System loads the software model of the design in the RCC Computing System 
2600 and the hardware model in the RCC Hardware Accelerator 2620 with the previously saved 
^ 10 configuration information fi^om simulation state tO. The RCC System then fast simulates from 

simulation time tO to simulation time tl . During the fast simulation operation, the RCC Computing 
iJI System loads the previously saved file containing the compressed primary inputs. The RCC 
M Computing System decompresses the compressed primary inputs and delivers the decompressed 

primary inputs to the RCC Hardware Accelerator 2620 for evaluation. Like the initial fast 
^ 15 simulation operation which compressed and saved the primary inputs for the simulation session 
a range, the primary outputs which are the evaluated results (e.g., hardware model node values and 
J j register states) are not saved during the fast simulation operation fi*om simulation time tO to 
J simulation time tl . 

Once the fast simulation operation reaches the beginning of the simulation target range, or 
20 simulation time tl , the RCC System then dumps the evaluated results (i.e., primary outputs Op) 
from the hardware model in the RCC Hardware Accelerator 2620 into a VCD file in the system 
disk. Unlike the initial fast simulation operation for the simulation session range, the RCC 
Computing System 2600 does not perform any compression. Again, the RCC Computing System 
2600 does not perform any regeneration operation for the software model since the user need not 
25 view the evaluation results at this time. By not performing any regeneration operation for the 
software model, the RCC System can quickly generate the VCD file. 

In other embodiments, however, the user may concurrently view the software model of his 
design for this simulation time period fi*om tl to t2 while saving the primary outputs. If so, the 
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RCC Computing System 2600 performs the software model regeneration operation to allow the user 
to view any and all states from any aspect of his design. 

At simulation time t2, the RCC Computing System 2600 ceases saving the evaluation 
outputs from the RCC Hardware Accelerator 2620 in the VCD file. At this point, the user can stop 

5 fast simulating. The RCC System now has the complete VCD file for the simulation target range 
and the user can proceed to analyze the VCD file in greater detail. 

When the user wants to analyze the VCD file, he need not rerun the simulation from the very 
beginning (e.g., simulation time tO). Instead, the user can command the RCC System to load the 
saved hardware state information from the beginning of the simulation target range and view the 

10 simulated results with the software model. This will be described in more detail below in the 
Simulation History Review section. 

Upon analyzing the VCD file, the user may or may not discover the bug. If the bug is found, 
the user will of course commence fixing the design. If the bug is not found, the user may have 
made a wrong guess of the simulation target range that he suspects has the bug. The user must 

15 employ the same process that he used above with respect to the decompress and VCD file dump. 
The user makes another guess with, hopefully, a better simulation target range withm the simulation 
session range. Having done so, the RCC System fast simulates from the beginning of the 
simulation session range to the beginning of the new simulation target range, decompressing the 
primary inputs and delivering them to the RCC Hardware Accelerator 2620 for evaluation. When 

20 the RCC System reaches the beginning of the new simulation target range, the primary outputs from 
the RCC Hardware Accelerator 2620 are dumped into a VCD file. At the end of the new simulation 
target range, the RCC System ceases dumping the hardware state information into the VCD file. At 
this point, tiie user can then view the VCD file for isolating the bug. 

In sum, from simulation time tO to simulation time tl, the RCC System fast simulates the 

25 design by decompressing tiie previously compressed primary inputs and delivering them to the 
hardware model for evaluation. During the simulation target range from simulation time tl to 
simulation time t2, tiie RCC System dumps tiie primary outputs from the hardware model into a 
VCD file. At the end of the simulation target range, the user can cease fast simulatmg the design. 
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At this point, the user can then view the VCD file by going directly to simulation time tl without 

rerunning the simulation from the very beginning at simulation time tO. 

When the review of this simulation target range is completed and the bug has been isolated 

and removed, the user can then proceed to the next simulation session range. This new simulation 
5 session range begins at simulation time t3. The particular length of the new simulation target range, 

which can be the same length as the previous simulation session range, is selected by the user. The 

RCC System loads the previously saved hardware state information corresponding to simulation 

time t3. The RCC System is now ready for fast simulation of this new simulation session range. 

Note that this new simulation session range corresponds to the range from simulation time tO to t3, 
10 where the loaded hardware state now corresponds to simulation time tO. The fast simulation, VCD 

on-demand dump, and VCD review process is similar to that described above. 

In accordance with one embodiment of the present invention, the decompression step does 

not negatively impact performance. The RCC System can decompress the simulation history (i.e., 

compressed and recorded primary inputs) at a rate of 20,000 to 200,000 simulation time steps per 
15 second. With proper checkpoint control, the RCC System can extract (i.e., reproduction of the 

simulation from the primary inputs and selected VCD file generation) the simulation history within 

50 seconds. 

As for the specific manner in which the VCD on-demand feature is controlled in the RCC 
System, the user must use the $axis_rpd command. The $axis_rpd is an interactive command to 

20 extract the RCC evaluation record and create a VCD file on demand. Unlike conventional 
simulation rewind technologies, the execution of the $axis__rpd command neither rewinds the 
internal simulation state nor corrpts the external PLI and file I/O states. The user can continue 
simulation after invoking the $axis_rpd command in the same manner as the user is capable of 
simulating after the $stop command. 

25 When no arguments are specified, the $axis_rpd command displays all available simulation 

time periods within the simulation session range; that is, the user can select the simulation target 
range. The time unit is the same time unit in the command line interface. An example of a 
simulation log is as follows: 
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CI > $rcc(recordj rl); 
C2>#1000 $rcc(xtO, run); 
C3> #50000 $rcc(off); 
C4> #50500 $rcc(run); 
C5 > #60000 $rcc(stop); . 
~ Start RCC engine at 100500. 

— Back to SIM: stop RCC engine at 5000000. 
~ Start RCC engine at 5050500. 

— Back to SIM: stop RCC engine at 6000000, 
Interrupt at simulation time 60000.0000ns 

C6 > $axis_rpd; 
available simulation history: 
1005.000000 to 50000.000000 
50505.000000 to 60000.000000 
Interrupt at simulation time 60000.0000ns 

From this simulation log, the user used the RCC engine form the time right after 1000 to 
50000 and the time right after 50500 to 60000. Thus, $axis_rpd shows the recorded simulation 
windows. 

To generate a VCD file from the simulation history, the user uses the $axis_rpd command 
with the fi)llowing control arguments: 

$axisj:pd(start-time, end-time, "dump-file-name", <level and scope control>); 

The start-time and end-time specify the simulation time window, or the simulation target 
range, for the VCD file. The unit of the time control arguments is the time unit used in the 
command line interface. The "dump-file-name" is the name of the VCD file. The dump <level and 
scope control> parameters are identical to the standard $dumpvars command in the IEEE Verilog. 

As an example of the $axis_rpd command: 
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C7 > $axis_rpd(50505, 50600, "fl.dump"); 

— start RCC VCD at 50505.010000 !! 

— end RCC VCD at 50600.000000 !! 

5 Interrupt at simulation time 60000.0000ns 

This $axis_rpd command creates a VCD file called "fl,dump" for the simulation target 
range from simulation time 50505 to 50600. Just like $dumpvars, if no level and scope control 
parameters are provided, the $axis_rpd command will dump the entire hardware states or primary 
10 outputs. 

Another example of the use of the $axis_rpd command is as follows: 

C8 > $axis__rpd(40444, 50600, "fl.dump", 2, dpO); 

— start RCC VCD at 40000.000000 !! 

— skip at time 50000.000000. 

— continue at time 50505.000000 !! 

— end RCC VCD at 50600.000000 !! 
Interrupt at simulation time 60000.0000ns 

20 This $axis_rpd command creates a 2-level VCD file "f2.dump" on the scope dpO from time 

40000 to 50600. Since the simulation swaps back to software control during time 50000 to 50500, 
$axis_rpd skips that window because no simulation record is available. 

VCD on-demand is also available after the user terminates the simulation process. To 
conduct off-line VCD on-demand, the user starts the simulation program named "vlg" with the 
25 +rccplay option. With this option, the RCC System is instructed to extract the simulation record 
instead of executing the normal initiaUzation sequence for simulation. Once the user enters the 
simulation program, the user can use the same $axis_rpd command to obtain VCD on demand. An 
example of this procedure is as follows: 
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axisl5:3-dpO_rtlc> vlg +rccplay+rl -s 

— Start replay record ./AxisWork/rl at time 100500 
CI > $axis rpd; 

available simulation history; 
5 1005.000000 to 50000.000000 

50505,000000 to 60000.000000 
Interrupt at simulation time 100500 
C2 > $axis_rpd(40000, 45000, "f2.dump"); 

— start RCC VCD at 40000.000000 !! 
C3 10 — end RCC VCD at 45000.000000 ! ! 
r5 Interrupt at simulation time 4500000 

m C3> 

in In the above example, the simulation record "rl" is used to extract the simulation history and 

%^ 15 produce the VCD on the entire design from time 40000 to 45000. 

fll Simulation History Review 

rT Once the VCD file of the simulation target range (i.e., simidation times tl to t2) has been 

generated by the RCC System, the user need not fast simulate from simulation time t2 to t3. 
20 Instead, the RCC System allows the user to cease simulation and proceed directly to the beginning 
of the simulation target range, or simulation time tl . Thus, in contrast to the prior art, the user does 
not have to rerun the simulation from the very beginning (e.g., simulation time tO). The hardware 
states that have been dumped into the VCD file reflects the evaluation of the entire history of 
primary inputs from simulation time tO, including the primary inputs from simulation times tl to t2. 



25 



The RCC System loads the VCD file. Thereafter, the saved primary outputs are delivered to 
the RCC Computing System 2600 so that the software model, and all of its many combinational 
logic circuits, can be regenerated with the correct state information. The user then views the 
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software model with a waveform viewer for debugging. With the VCD on hand, the user can step 
through his software model very carefully step-by-step until the bug is isolated. 

With this VCD on-demand feature, the user can select any simulation target range within the 
simulation session range and perform software simulation to isolate the bug. If the bug cannot be 
5 found in the selected simulation target range, the user can select another different simulation target 
range on demand. Because all of the primary inputs from the test bench process are recorded for the 
entire simulation session range, any portion of this simulation can be reproduced and viewed on 
demand without rerunning the simulation. This feature allows the user to repeatedly focus on 
multiple and different simulation target ranges until he has fixed the bug within this simulation 
□ 10 session range. 

Furthermore, this VCD on-demand feature is supported on-line in the middle of the 
in simulation process as well as off-line after the simulation process has terminated. This on-line 
'"J support is possible the hardware states at simulation time tO can be saved in system disk and the 
\ n primary inputs can be compressed and recorded for any length of the simulation session range. 

15 Thereafter, the user can then specify a simulation target range for a more focused analysis of the 
lO primary outputs. 

f- The off-line support is possible because the hardware states at simulation time tO, the entire 
f f primary inputs for the simulation session range, and the hardware states at simulation time tl are all 

saved in the system disk. Thus, the user can return to debugging his design by loading the design 
20 corresponding to simulation time tO and then specifying the simulation target range. Also, the user 

can proceed directly to the next simulation target range by loading the hardware states 

corresponding to simulation time t3. 

VL HARDWARE IMPLEMENTATION SCHEMES 
25 A. OVERVIEW 

The SEmulation system implements an array of FPGA chips on a reconfigurable board. 
Based on the hardware model, the SEmulation system partitions, maps, places, and routes each 
selected portion of the user's circuit design onto the FPGA chips. Thus, for example, a 4x4 
array of 16 chips may be modeling a large circuit spread out across these 16 chips. The 
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interconnect scheme allows each chip to access another chip within 2 "jumps" or links. 

Each FPGA chip implements an address pointer for each of the I/O address spaces (i.e., 
REG, CLK, S2H, H2S). The combination of all address pointers associated with a particular 
address space are chained together. So, during data transfer, word data in each chip is 
5 sequentially selected from/to the main FPGA bus and PCI bus, one word at a time for the 

selected address space in each chip, and one chip at a time, until the desired word data have been 
accessed for that selected address space. This sequential selection of word data is accomplished 
by a propagating word selection signal. This word selection signal travels through the address 
pointer in a chip and then propagates to the address pointer in the next chip and continues on till 
0 10 the last chip or the system initializes the address pointer. 

^ The FPGA bus system in the reconfigurable board operates at twice the PCI bus 

bandwidth but at half the PCI bus speed. The FPGA chips are thus separated into banks to utilize 
H the larger bandwidth bus. The throughput of this FPGA bus system can track the throughput of 
m the PCI bus system so performance is not lost by reducing the bus speed. Expansion is possible 

t, 15 through bigger boards which contains more FPGA chips or piggyback boards that extend the 

■Q 

sO bank length. 

H B. ADDRESS POINTER 

FIG. 11 shows one embodiment of the address pointer of the present invention. All I/O 
20 operations go through DMA streaming. Because the system has only one bus, the system 

accesses data sequentially one word at a time. Thus, one embodiment of the address pointer uses 
a shift register chain to sequentially access the selected words in these address spaces. The 
address pointer 400 includes flip-flops 401-405, an AND gate 406, and a couple of control 
signals, INITIALIZE 407 and MOVE 408. 
25 Each address pointer has n outputs (WO, Wl, W2, . . . , Wn-1) for selecting a word out 

of n possible words in each FPGA chip corresponding to the same word in the selected address 
space. Depending on the particular user circuit design being modeled, the number of words n 
may vary from circuit design to circuit design and, for a given circuit design, n varies from 
FPGA chip to FPGA chip. In FIG. 11, the address pointer 400 is only a 5 word (i.e., n=5) 
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address pointer. Thus, this particular FPGA chip which contains this 5-word address pointer for 
a particular address space has only 5 words to select. Needless to say, the address pointer 400 
can implement any number of words n. This output signal Wn can also be called the word 
selection signal. When this word selection signal reaches the output of the last flip-flop in this 
5 address pointer, it is called an OUT signal to be propagated to the inputs of the address pointers 
of the next FPGA chip. 

When the INITIALIZE signal is asserted, the address pointer is initialized. The first flip- 
flop 401 is set to "1" and all other flip-flops 402-405 are set to "0." At this point, the 
initialization of the address pointer will not enable any word selection; that is, all the Wn outputs 

10 are still at ''0" after initialization. The address pointer initialization procedure will also be 
discussed with respect to FIG. 12, 

The MOVE signal controls the advance of the pointer for word selection. This MOVE 
signal is derived from the READ, WRITE, and SPACE index control signals from the FPGA I/O 
controller. Because every operation is essentially a read or a write, the SPACE index signal 

15 essentially determines which address pointer will be applied with the MOVE signal. Thus, the 
system activates only one address pointer associated with a selected I/O address space at a tune, 
and during that time, the system applies the MOVE signal only to that address pointer. The 
MOVE signal generation is discussed further with respect to FIG. 13. Referring to FIG. 11, 
when the MOVE signal is asserted, the MOVE signal is provided to an input to an AND gate 406 

20 and the enable mput of the flip-flops 401-405. Hence, a logic " 1 " will move from the word 

output Wi to Wi+ 1 every system clock cycle; that is, the pointer will move from Wi to Wi+ 1 to 
select the particular word every cycle. When the shifting word selection signal makes its way to 
the output 413 (labeled herein as "OUT") of the last flip-flop 405, this OUT signal should 
thereafter make its way to the next FPGA chip via a multiplexed cross chip address pointer chain, 

25 which will be discussed with respect to FIGS. 14 and 15, unless the address pointer is being 
mitialized again. 

The address pointer initialization procedure will now be discussed. FIG. 12 shows a state 
transition diagram of the address pointer initialization for the address pointer of FIG. 11. 
Initially, state 460 is idle. When the DATA XSFR is set to "1," the system goes to state 461, 
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where the address pointer is initialized. Here, the INITIALIZE signal is asserted. The first flip- 
flop in each address pointer is set to " 1 " and all other flip-flops in the address pointer are set to 
"0." At this point, the mitialization of the address pointer will not enable any word selection; 
that is, all the Wn outputs are still at "0. " The next state is wait state 462 while the 
5 DATA XSFR is still "1." When the DATA_XSFR is "0," the address pointer initialization 
procedure has completed and the system returns to the idle state 460. 

The MOVE signal generator for generating the various MOVE signals for the address 
pointer will now be discussed. The SPACE index, which is generated by the FPGA I/O 
controller (item 327 in FIG. 10; FIG. 22), selects the particular address space (i.e., REG read, 

10 REG write, S2H read, H2S write, and CLK write). Within this address space, the system of the 
present invention sequentially selects the particular word to be accessed. The sequential word 
selection is accomplished in each address pointer by the MOVE signal. 

One embodiment of the MOVE signal generator is shown in FIG. 13. Each FPGA chip 
450 has address pointers that correspond to the various software/hardware boundary address 

15 spaces (i.e., REG, S2H, H2S, and CLK). In addition to the address pointer and the user's cu*cuit 
design that is modeled and unplemented in FPGA chip 450, the MOVE signal generator 470 is 
provided in the FPGA chip 450. The MOVE signal generator 470 includes an address space 
decoder 451 and several AND gates 452-456. The input signals are the FPGA read signal 
(F_RD) on wire line 457, FPGA write signal (F_WR) on wire line 458, and the address space 

20 signal 459. The output MOVE signal for each address pointer corresponds to REGR-move on 
wire line 464, REGW-move on wire line 465, S2H-move on wire line 466, H2S-move on wire 
line 467, and CLK-move on wire line 468, depending on which address space's address pointer is 
applicable. These output signals correspond to the MOVE signal on wire line 408 (FIG. 11). 

The address space decoder 451 receives a 3-bit input signal 459. It can also receive just a 

25 2-bit input signal. The 2-bit signal provides for 4 possible address spaces, whereas the 3-bit 

input provides for 8 possible address spaces. In one embodiment, CLK is assigned to "00," S2H 
is assigned to "01," H2S is assigned to " 10," and REG is assigned to " 11." Dependmg on the 
input signal 459, the output of the address space decoder outputs a " 1 " on one of the wire lines 
460-463, corresponding to REG, H2S, S2H, and CLK, respectively, while the remaining wire 
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lines are set to "0." Thus, if any of these output wire lines 460-463 is "0," the corresponding 
output of the AND gates 452-456 is "0." Analogously, if any of these input wire lines 460-463 
is "1," the corresponding output of the AND gates 452-456 is " 1." For example, if the address 
space signal 459 is " 10," then the address space H2S is selected. Wire line 461 is "1" while the 
5 remaining wire lines 460, 462, and 463 are "0." Accordingly, wire line 466 is "1," while the 
remaining output wire lines 464, 465, 467, and 468 are "0." Similarly, if wire line 460 is "1," 
The REG space is selected and depending on whether a read (F RD) or write (F WR) operation 
is selected, either the REGR-move signal on wire line 464 or the REGW-move signal on wire 
line 465 will be "1." 

10 As explained earlier, the SPACE index is generated by the FPGA I/O controller. In code, 

the MOVE controls are: 

REG space read pointer: REGR-move = (SPACE-index = = #REG) & READ; 

REG space write pointer: REGW-move = (SPACE-index = = #REG) & WRITE; 

S2H space read pointer: S2H-move = (SPACE-index = = #S2H) & READ; 
15 H2S space write pointer: H2S-move = (SPACE-index = = #H2S) & WRITE; 

CLK space write pointer: CLK-move = (SPACE-index = = #CLK)& WRITE; 
This is the code equivalent for the logic diagram of the MOVE signal generator on FIG. 13. 

As mentioned above, each FPGA chip has the same number of address pointers as address 
spaces in the software/hardware boundary. If the software/hardware boundary has 4 address 
20 spaces (i.e., REG, S2H, H2S, and CLK), each FPGA chip has 4 address pointers corresponding 
to these 4 address spaces. Each FPGA needs these 4 address pointers because the particular 
selected word in the selected address space being processed may reside in any one or more of the 
FPGA chips, or the data in the selected address space affects the various circuit elements modeled 
and implemented in each FPGA chip. To ensure that the selected word is processed with the 
25 appropriate circuit element(s) in the appropriate FPGA chip(s), each set of address pointers 
associated with a given software/hardware boundary address space (i.e., REG, S2H, H2S, and 
CLK) is "chained" together across several FPGA chips. The particular shifting or propagating 
word selection mechanism via the MOVE signals, as explained above with respect to FIG. 11, is 
still utilized, except that in this "chain" embodiment, an address pointer associated with a 
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particular address space in one FPGA chip is 'chained" to an address pointer associated with the 
same address space in the next FPGA chip. 

Implementing 4 input pins and 4 output pins to chain the address pointers would 
accomplish the same purpose. However, this implementation would be too costly in terms of 
5 efficient use of resources; that is, 4 wires would be needed between two chips, and 4 input pins 
and 4 output pins would be needed in each chip. One embodiment of the system in accordance 
with the present invention uses a multiplexed cross chip address pointer chain which allows the 
hardware model to use only one wire between chips and only 1 input pin and 1 output pin in each 
chip (2 I/O pins in a chip). One embodiment of the multiplexed cross chip address pointer chain 
AO is shown in FIG. 14. 

fi In the embodiment shown in FIG. 14, the user's circuit design had been mapped and 

ifl partitioned in three FPGA chips 415-417 in the reconfigurable hardware board 470, The address 

^ pointers are shown as blocks 421-432. Each address pointer, for example address pointer 427, 

has a structure and function similar to the address pointer shown in FIG. 11, except that the 
A5 number of words Wn and hence the number of flip-flops may vary depending on how many 

words are implemented in each chip for the user's custom circuit design. 
!;7- For the REGR address space, the FPGA chip 415 has address pointer 421, FPGA chip 

O 416 has address pointer 425, and FPGA chip 417 has address pointer 429. For the REGW 

address space, the FPGA chip 415 has address pointer 422, FPGA chip 416 has address pointer 
20 426, and FPGA chip 417 has address pointer 430. For the S2H address space, the FPGA chip 
415 has address pointer 423, FPGA chip 416 has address pointer 427, and FPGA chip 417 has 
address pointer 431. For the H2S address space, the FPGA chip 415 has address pointer 424, 
FPGA chip 416 has address pointer 428, and FPGA chip 417 has address pointer 432. 

Each chip 415-417 has a multiplexer 418-420, respectively. Note that these multiplexers 
25 418-420 may be models and the actual implementation may be a combination of registers and 

logic elements, as known to those ordinarily skilled in the art. For example, the multiplexer may 
be several AND gates feeding into an OR gate as shown in FIG. 15. The multiplexer 487 
includes four AND gates 481-484 and an OR gate 485. The inputs to the multiplexer 487 are the 
OUT and MOVE signals from each address pointer in the chip. The output 486 of the 
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multiplexer 487 is a chain-out signal which is passed to the inputs to the next FPGA chip. 

In FIG. 15, this particular FPGA chip has four address pointers 475-478, corresponding 
to I/O address spaces. The outputs of the address pointers, the OUT and MOVE signals, are 
inputs to the multiplexer 487. For example, address pointer 475 has an OUT signal on wire line 
5 479 and a MOVE signal on wire line 480. These signals are inputs to AND gate 481. The 
output of this AND gate 481 is an input to OR gate 485. The output of the OR gate 485 is the 
output of this multiplexer 487. In operation, the OUT signal at the output of each address pointer 
475-478 in combination with their corresponding MOVE signals and the SPACE index serve as a 
selector signal for the multiplexer 487; that is, both the OUT and MOVE signals (which are 
OlO derived from the SPACE index signals) have to be asserted active (e.g., logic "1") to propagate 
ifl the word selection signal out of the multiplexer to the chain-out wire line. The MOVE signal 
J will be asserted periodically to move the word selection signal through the flip-flops in the 
/"^ address pointer so that it can be characterized as the input MUX data signal, 
in Returning to FIG. 14, these multiplexers 418-420 have four sets of inputs and one output, 

q 15 Each set of inputs includes: (1) the OUT signal found on the last output Wn-1 wire line for the 
f: address pointer (e.g., wire line 413 in the address pointer shown in FIG. 11) associated with a 
nJ particular address space, and (2) the MOVE signal. The output of each multiplexer 418-420 is 
f ^ the chain-out signal. The word selection signal Wn through the flip-flops in each address pointer 
becomes the OUT signal when it reaches the output of the last flip-flop in the address pointer. 
20 The chain-out signal on wu:e lines 433-435 will become "1" only when an OUT signal and a 
MOVE signal associated with the same address pointer are both asserted active (e.g., asserted 
"1"). 

For multiplexer 418, the inputs are MOVE signals 436-439 and OUT signals 440-443 
corresponding to OUT and MOVE signals from address pointers 421-424, respectively. For 
25 multiplexer 419, the inputs are MOVE signals 444-447 and OUT signals 452-455 corresponding 
to OUT and MOVE signals from address pointers 425-428, respectively. For multiplexer 420, 
the inputs are MOVE signals 448-451 and OUT signals 456-459 corresponding to OUT and 
MOVE signals from address pointers 429-432, respectively. 

In operation, for any given shift of words Wn, only those address pointers or chain of 
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address pointers associated with a selected I/O address space in the software/hardware boundary 
are active. Thus, in FIG. 14, only the address pointers in chips 415, 416, and 417 associated 
with one of the address spaces REGR, REGW, S2H, or H2S are active for a given shift. Also, 
for a given shift of the word selection signal Wn through the flip-flops, the selected word is 
5 accessed sequentially because of limitations on the bus bandwidth. In one embodiment, the bus is 
32 bits wide and a word is 32 bits, so only one word can be accessed at a time and delivered to 
the appropriate resource. 

When an address pointer is in the middle of propagating or shifting the word selection 
signal through its flip-flops, the output chain-out signal is not activated (e.g., not "1") and thus, 
^ilO this multiplexer in this chip is not yet ready to propagate the word selection signal to the next 

FPGA chip. When the OUT signal is asserted active (e.g., "1"), the chain-out signal is asserted 
^ active (e.g.,'' 1") indicating that the system is ready to propagate or shift the word selection signal 
bI to the next FPGA chip. Thus, accesses occur one chip at a time; that is, the word selection 
signal is shifted through the flip-flops in one chip before the word selection shift operation is 
C3 15 performed for another chip. Indeed, the chain-out signal is asserted only when the word selection 
l^T signal reaches the end of the address pointer in each chip. In code, the chain-out signal is: 

S Chain-out = (REGR-move & REGR-out) | (REGW-move & REGW-out) 1 (S2H-move & 

S2H-out) I (H2S-move & H2S-out); 

20 

In sum, for X number of I/O address spaces (i.e., REG, H2S, S2H, CLK) in the system, 
each FPGA has X address pointers, one address pointer for each address space. The size of each 
address pointer depends on the number of words required for modeling the user's custom circuit 
design in each FPGA chip. Assuming n words for a particular FPGA chip and hence, n words 
25 for the address pointer, this particular address pointer has n outputs (i.e., WO, Wl, W2, . . . , 
Wn-1). These outputs Wi are also called word selection signals. When a particular word Wi is 
selected, the Wi signal is asserted active (i.e., "1"). This word selection signal shifts or 
propagates down the address pointer of this chip until it reaches the end of the address pointer in 
this chip, at which point, it triggers the generation of a chain-out signal that starts the propagation 
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of the word selection signal Wi through the address pouiter in the next chip. In this way, a chain 
of address pointers associated with a given I/O address space can be implemented across all of the 
FPGA chips in this reconfigurable hardware board. 

5 C. GATED DATA/CLOCK NETWORK ANALYSIS 

The various embodiments of the present invention perform clock analysis in association 
with gated data logic and gated clock logic analysis. The gated clock logic (or clock network) and 
the gated data network determinations are critical to the successful implementation of the software 
clock and the logic evaluation in the hardware model during emulation. As discussed with 
y 10 respect to FIG. 4, the clock analysis is performed in step 305. To further elaborate on this clock 
@ analysis process, FIG. 16 shows a flow diagram in accordance with one embodiment of the 
,g present invention. FIG. 16 also shows the gated data analysis. 

The SEmulation system has the complete model of the user's circuit design in software 
and some portions of the user's circuit design in hardware. These hardware portions include the 
O 15 clock components, especially the derived clocks. Clock delivery timing issues arise due to this 
boundary between software and hardware. Because the complete model is in software, the 
software can detect clock edges that affect register values. In addition to the software model of 
H the registers, these registers are physically located in the hardware model. To ensure that the 

hardware registers also evaluate their respective inputs (i.e., moving the data at the D input to the 
20 Q output), the software/hardware boundary includes a software clock. The software clock 
ensures that the registers in the hardware model evaluate correctly. The software clock 
essentially controls the enable input of the hardware register rather than controlling the clock 
input to the hardware register components. This software clock avoids race conditions and 
accordmgly, precise timing control to avoid hold-time violations is not needed. The clock 
25 network and gated data logic analysis process shown in FIG. 16 provides a way of modeling and 
implementing the clock and data delivery system to the hardware registers such that race 
conditions are avoided and a flexible software/hardware boundary implementation is provided. 

As discussed earlier, primary clocks are clock signals from test-bench processes. All 
other clocks, such as those clock signals derived fi-om combinational components, are derived or 
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gated clocks. A primary clock can derive both gated clocks and gated data signals. For the most 
part, only a few (e,g., 1-10) derived or gated clocks are in the user's circuit design. These 
derived clocks can be implemented as software clocks and will stay in software. If a relatively 
large number (e.g., more than 10) of derived clocks are present in the circuit design, the 
5 SEmulation system will model them into hardware to reduce I/O overhead and maintain the 

SEmulation system's performance. Gated data is data or control input of a register other than the 
clock driven from the prunary clock through some combinational logic. 

The gated data/clock analysis process starts at step 500. Step 501 takes the usable source 
design database code generated from the HDL code and maps the user's register elements to the 
f^lO SEmulation system's register components. This one-to-one mapping of user registers to 
^fl SEmulation registers facilitates later modeling steps. In some cases, this mapping is necessary to 
,g handle user circuit designs which describe register elements with specific primitives. Thus, for 
i2 RTL level code, SEmulation registers can be used quite readily because the RTL level code is at 
- ^ a high enough level, allowing for varying lower level implementations. For gate level netlist, the 
O 15 SEmulation system will access the cell library of components and modify them to suit the 
[2 particular circuit design-specific logic elements. 

Step 502 extracts clock signals out of the hardware model's register components. This 
1^==^ step allows the system to determine primary clocks and derived clocks. This step also determines 
all the clock signals needed by various components in the circuit design. The information from 
20 this step facilitates the software/hardware clock modeling step. 

Step 503 determines primary clocks and derived clocks. Primary clocks originate from 
test-bench components and are modeled in software only. Derived clocks are derived from 
combinational logic, which are in turn driven by primary clocks. By default, the SEmulation 
system of the present invention will keep the derived clocks in software. If the number of 
25 derived clocks is small (e.g., less than 10), then these derived clocks can be modeled as software 
clocks. The number of combinational components to generate these derived clocks is small, so 
significant I/O overhead is not added by keeping these combinational components residing in 
software. If, however, the number of derived clocks is large (e.g., more than 10), these derived 
clocks may be modeled in hardware to minimize I/O overhead. Sometimes, the user's circuit 
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design uses a large number of derived clock components derived from primary clocks. The 
system thus builds the clocks in hardware to keep the number of software clocks small. 

Decision step 504 requires the system to determine if any derived clocks are found in the 
user's circuit design. If not, step 504 resolves to "NO" and the clock analysis ends at step 508 
5 because all the clocks in the user's circuit design are primary clocks and these clocks are simply 
modeled in software. If derived clocks are found in the user's circuit design, step 504 resolves to 
"YES" and the algorithm proceeds to step 505. 

Step 505 determines the fan-out combinational components ft-om the primary clocks to the 
derived clocks. In other words, this step traces the clock signal datapaths from the primary 
10 clocks through the combinational components. Step 506 determmes the fan-in combinational 
components from the derived clocks. In other words, this step traces the clock signal datapaths 
from the combinational components to the derived clocks. Determining fan-out and fan-in sets in 
the system is done recursively in software. The fan-in set of a net N is as follows: 

15 FanlnSetof anetN: 

find all the components driving net N; 
for each component X driving net N do: 
if the component X is not a combinational component then 
return; 
20 else 

for each input net Y of the component X 

add the Fanin set W of net Y to the Fanin Set of net N 
end for 

add the component X into N; 
25 end if 

endfor 

A gated clock or data logic network is determined by recursively determining the fan-in 
set and fan-out set of net N, and determining their intersection. The ultimate goal here is to 
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determine the so-called Fan-In Set of net N. The net N is typically a clock input node for 
determining the gated clock logic from a fan-in perspective. For determining the gated data logic 
from a fan-in perspective, net N is a clock input node associated with the data input at hand. If 
the node is on a register, the net N is the clock input to that register for the data input associated 

5 with that register. The system finds all the components driving net N. For each component X 
driving net N, the system determines if the component X is a combinational component or not. If 
each component X is not a combinational component, then the fan-in set of net N has no 
combinational components and net N is a primary clock. 

If, however, at least one component X is a combinational component, the system then 

10 determines the input net Y of the component X. Here, the system is looking further back m the 
circuit design by finding the input nodes to the component X. For each input net Y of each 
component X, a fan-in set W may exist which is coupled to net Y. This fan-in set W of net Y is 
added to the Fan-In Set of net N, then the component X is added into set N. 

The fan-out set of a net N is determined in a similar maimer. The fan-out set of net N is 

15 determined as follows: 



FanOutSet of a net N: 
find all the components using the net N; 
for each component X using the net N do: 
20 if the component X is not a combinational component then 

return; 
else 

for each output net Y of component X 
add the FanOut Set of net Y to the FanOut Set of Net N 
25 end for 

add the component X into N; 
end if 
end for 
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Again, the gated clock or data logic network is determined by recursively determining the 
fan-in set and fan-out set of net N, and determining their intersection. The ultimate goal here is 
to determine the so-called Fan-Out Set of net N. The net N is typically a clock output node for 
determining the gated clock logic from a fan-out perspective. Thus, the set of all logic elements 
5 using net N will be determined. For determining the gated data logic from a fan-out perspective, 
net N is a clock output node associated with the data output at hand. If the node is on a register, 
the net N is the output of that register for the primary clock-driven input associated with that 
register. The system finds all the components usmg net N. For each component X using net N, 
the system determines if the component X is a combinational component or not. If each 

^ 10 component X is not a combinational component, then the fan-out set of net N has no 

C' combinational components and net N is a primary clock. 

^ If, however, at least one component X is a combinational component, the system then 

, determines the output net Y of the component X. Here, the system is looking further forward 

W from the primary clock in the circuit design by finding the output nodes from the component X. 
□ 15 For each output net Y from each component X, a fan-out set W may exist which is coupled to net 
u Y. This fan-out set W of net Y is added to the Fan-Out Set of net N, then the component X is 
^ added mto set N. 

H Step 507 determines the clock network or gated clock logic. The clock network is the 

intersection of the fan-in and fan-out combinational components. 

20 Analogously, the same fan-in and fan-out principle can be used to determine the gated data 

logic. Like the gated clocks, gated data is the data or control input of a register (except for the 
clock) driven by a primary clock through some combinational logic. Gated data logic is the 
intersection of the fan-in of the gated data and fan-out from the primary clock. Thus, the clock 
analysis and gated data analysis result in a gated clock network/logic through some combinational 

25 logic and a gated data logic. As described later, the gated clock network and the gated data 

network determinations are critical to the successful implementation of the software clock and the 
logic evaluation in the hardware model during emulation. The clock/data network analysis ends 
at step 508. 

FIG. 17 shows a basic building block of the hardware model in accordance with one 
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embodiment of the present invention. For the register component, the SEmulation system uses a 
D-type flip-flop with asynchronous load control as the basic block for building both edge trigger 
(i.e., flip-flops) and level sensitive (i.e., latches) register hardv^are models. This register model 
building block has the following ports: Q (the output state); A_E (asynchronous enable); A D 
5 (asynchronous data); S_E (synchronous enable); S_D (synchronous data); and of course, 
System, elk (system clock). 

This SEmulation register model is triggered by a positive edge of the system clock or a 
positive level of the asynchronous enable (A_E) input. When either of these two positive edge or 
positive level triggering events occurs, the register model looks for the asynchronous enable 
^10 (A_E) input. If the asynchronous enable (A_E) input is enabled, the output Q takes on the value 
O of the asynchronous data (A_D); otherwise, if the synchronous enable (S_E) input is enabled, the 
£ output Q takes on the value of the synchronous data (S_D). If, on the other hand, neither the 
u" asynchronous enable (A_E) nor the synchronous enable (S_E) input is enabled, the output Q is 

not evaluated despite the detection of a positive edge of the system clock. In this way, the inputs 
Q 15 to these enable ports control the operation of this basic building block register model. 
i2 The system uses software clocks, which are special enable registers, to control the enable 

L» inputs of these register models. In a complex user circuit design, millions of elements are found 
H in the circuit design and accordingly, the SEmulator system will unplement millions of elements 
in the hardware model. Controlling all of these elements individually is costly because the 
20 overhead of sending millions of control signals to the hardware model will take a longer time than 
evaluating these elements in software. However, even this complex circuit design usually calls 
for only a few (from 1-10) clocks and clocks alone are sufficient to control the state changes of a 
system with register and combinational components only. The hardware model of the SEmulator 
system uses only register and combinational components. The SEmulator system also controls 
25 the evaluation of the hardware model through software clocks. In the SEmulator system, the 
hardware models for registers do not have the clock directly connected to other hardware 
components; rather, the software kernel controls the value of all clocks. By controlling a few 
clock signals, the kernel has the full control over the evaluation of the hardware models with 
negligible amount of coprocessor intervention overhead. 
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Depending on whether the register model is used as a latch or a flip-flop, the software 
clock will be input to either the asynchronous enable (A_E) or synchronous enable (S_E) wire 
lines. The application of the software clock from the software model to the hardware model is 
triggered by edge detection of clock components. When the software kernel detects the edge of 
5 clock components, it sets the clock-edge register through the CLK address space. This clock- 
edge register controls the enable input, not the clock input, to the hardware register model. The 
global system clock still provides the clock input to the hardware register model. However, the 
clock-edge register provides the software clock signal to the hardware register model through a 
double-buffered interface. As will be explained later, a double-buffer interface from the software 
JSlO clock to the hardware model ensures that all the register models will be updated synchronously 
J2 widi respect to the global system clock. Thus, the use of the software clock eliminates the risk of 
M hold time violations. 

FIGS. 18(A) and 18(B) show the implementation of the building block register model for 
latches and flip-flops. These register models are software clock-controlled via the appropriate 

S 15 enable inputs. Depending on whether the register model is used as a flip-flop or latch, the 

. Pi 

U asynchronous ports (A_E, AJD) and synchronous ports (S_E, S_D) are either used for the 

software clock or I/O operations. FIG. 18(A) shows the register model implementation if it is 
used as a latch. Latches are level-sensitive; that is, so long as the clock signal has been asserted 
(e.g., "1"), the output Q follows the input (D). Here, the software clock signal is provided to 

20 the asynchronous enable (A_E) input and the data input is the provided to the asynchronous data 
(A D) input. For I/O operations, the software kernel uses the synchronous enable (S_E) and 
synchronous data (S_D) inputs to download values into the Q port. The S_E port is used as a 
REG space address pointer and the S_D is used to access data to/from the local data bus. 

FIG. 18(B) shows the register model unplementation if it is used as a design flip-flop. 

25 Design flip-flops use the following ports for determming the next state logic: data (D), set (S), 
reset (R), and enable (E). All the next state logic of a design flip-flop is factored into a hardware 
combinational component which feeds into the synchronous data (S_D) input. The software clock 
is input to the synchronous enable (S_E) input. For I/O operations, the software kernel uses the 
asynchronous enable (A_E) and asynchronous data (A_D) inputs to download values into the Q 
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port. The A_E port is used as a REG space write address pointer and the A_D port is used to 
access data to/from the local data bus. 

The software clock will now be discussed. One embodiment of the software clock of the 
present invention is a clock enable signal to the hardware register model such that the data at the 
5 inputs to these hardware register models are evaluated together and synchronously with the 
system clock. This eliminates race conditions and hold-time violations. One implementation of 
the software clock logic includes clock edge detection logic in software which triggers additional 
logic in the hardware upon clock edge detection. Such enable signal logic generates an enable 
signal to the enable inputs to hardware register models before the arrival of the data to these 
y 10 hardware register models. The gated clock network and the gated data network determinations 
are critical to the successfiil implementation of the software clock and the logic evaluation in the 
V hardware model during hardware acceleration mode. As explained earlier, the clock network or 

gated clock logic is the intersection of the fan-in of the gated clock and fan-out of the primary 
Ul clock. Analogously, the gated data logic is also the intersection of the fan-in of the gated data 
p 15 and fan-out of the primary clock for the data signals. These fan-in and fen-out concepts are 
2 discussed above with respect to FIG. 16. 

As discussed earlier, primary clocks are generated by test-bench processes in software. 
1^ Derived or gated clocks are generated from a network of combinational logic and registers which 
are in turn driven by the primary clocks. By default, the SEmulation system of the present 
20 invention will also keep the derived clocks in software. If the number of derived clocks is small 
(e.g., less than 10), then these derived clocks can be modeled as software clocks. The number of 
combinational components to generate these derived clocks is small, so significant I/O overhead 
is not added by modeling these combinational components in software. If, however, the number 
of derived clocks is large (e.g., more than 10), these derived clocks and their combinational 
25 components may be modeled in hardware to minimize I/O overhead. 

Ultimately, in accordance with one embodiment of the present invention, clock edge 
detection occurring in software (via the input to the primary clock) can be translated to clock 
detection in hardware (via the input to a clock edge register). The clock edge detection in 
software triggers an event in hardware so that the registers in the hardware model receive the 
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clock enable signal before the data signal to ensure that the evaluation of the data signal occurs in 
synchronization with the system clock to avoid hold-time violations. 

As stated earlier, the SEmulation system has the complete model of the user's circuit 
design in software and some portions of the user's circuit design in hardware. As specified in the 
5 kernel, the software can detect clock edges that affect hardware register values. To ensure that 
the hardware registers also evaluate their respective inputs, the software/hardware boundary 
includes a software clock. The software clock ensures that the registers in the hardware model 
evaluate in synchronization with the system clock and without any hold-time violations. The 
software clock essentially controls the enable input of the hardware register components, rather 

10 than controlling the clock input to the hardware register components. The double-buffered 
approach to unplementing the software clocks ensures that the registers evaluate in 
synchronization with the system clock to avoid race conditions and eliminates the need for precise 
timing controls to avoid hold-time violations. 

FIG. 19 shows one embodiment of the clock implementation system in accordance with 

15 the present invention. Initially, the gated clock logic and the gated data logic are determined by 
the SEmulator system, as discussed above with respect to FIG. 16. The gated clock logic and the 
gated data logic are then separated. When implementing the double buffer, the driving source 
and the double-buffered primary logic must also be separated. Accordingly, the gated data logic 
513 and gated clock logic 514, from the fan-m and fan-out analysis, have been separated. 

20 The modeled primary clock register 510 includes a first buffer 511 and a second buffer 

512, which are both D registers. This primary clock is modeled in software but the double-buffer 
implementation is modeled in both software and hardware. Clock edge detection occurs in the 
primary clock register 510 in software to trigger the hardware model to generate the software 
clock signal to the hardware model. Data and address enter the first buffer 51 1 at wire lines 519 

25 and 520, respectively. The Q output of this first buffer 511 on wire line 521 is coupled to the D 
input of second buffer 512. The Q output of this first buffer 511 is also provided on wire line 
522 to the gated clock logic 514 to eventually drive the clock input of the first buffer 516 of the 
clock edge register 515. The Q output of the second buffer 512 on wire line 523 is provided to 
the gated data logic 513 to eventually drive the input of register 518 via wke line 530 in the 
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user's custom-designed circuit model. The enable input to the second buffer 512 in the primary 
clock register 510 is the INPUT-EN signal on wire line 533 from a state machine, which 
determines evaluation cycles and controls various signals accordingly. 

The clock edge register 515 also includes a first buffer 516 and a second buffer 517. The 
5 clock edge register 515 is implemented in hardware. When a clock edge detection occurs in 
software (via the input to the primary clock register 510), this can trigger the same clock edge 
detection m hardware (via clock edge register 515) in hardware. The D input to the first buffer 

516 on wire line 524 is set to logic "1." The clock signal on wire line 525 is derived from the 
gated clock logic 514 and ultimately from the primary clock register 510 at the output on wire 

10 line 522 of the first buffer 511. This clock signal on wire line 525 is the gated clock signal. The 
enable wire line 526 for the first buffer 516 is the -EVAL signal from the state machme that 
controls the I/O and evaluation cycles (to be discussed later). The first buffer 516 also has a 
RESET signal on wire line 527. This same RESET signal is also provided to the second buffer 

517 in the clock edge register 515. The Q output of the first buffer 516 on wire line 529 is 

15 provided to the D input to the second buffer 517. The second buffer 517 also has an enable input 
on wire line 528 for the CLK-EN signal and a RESET input on wire line 527. The Q output of 
the second buffer 517 on wire line 532 is provided to the enable input of the register 518 m the 
user's custom-designed circuit model. Buffers 511, 512, and 517 along with register 518 are 
clocked by the system clock. Only buffer 516 in the clock edge register 515 is clocked by a 

20 gated clock from a gated clock logic 514. 

Register 518 is a typical D-type register model that is modeled m hardware and is part of 
the user's custom circuit design. Its evaluation is sfrictly controlled by this embodiment of the 
clock implementation scheme of the present invention. The ultunate goal of this clock set-up is to 
ensure that the clock enable signal at wire line 532 arrives at the register 518 before the data 

25 signal at wire line 530 so that the evaluation of the data signal by this register will be 
synchronized with the system clock and without race conditions. 

To reiterate, the modeled primary clock register 510 is modeled in software but its double 
buffer implementation is modeled m both software and hardware. The clock edge register 515 is 
implemented in hardware. The gated data logic 513 and gated clock logic 514, from the fan-in 
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and fan-out analysis, have also been separated for modeling purposes, and can be modeled in 
software (if the number of gated data and gated clocks is small) or hardware (if the number of 
gated data and gated clocks is large). The gated clock network and the gated data network 
determinations are critical to the successful implementation of the software clock and the logic 

5 evaluation in the hardware model during hardware acceleration mode. 

The software clock unplementation relies primarily on the clock set-up shown on FIG. 19 
along with the timing of the assertions of signals -EVAL, INPUT-EN, CLK-EN, and RESET. 
The primary clock register 510 detects clock edges to trigger the software clock generation for 
the hardware model. This clock edge detection event triggers the "activation" of the clock edge 

10 register 515 via the clock input on wire line 525, gated clock logic 514, and wire line 522 so that 
the clock edge register 515 also detects the same clock edge. In this way, clock detection 
occurring in software (via the inputs 519 and 520 to the prunary clock register 510) can be 
translated to clock edge detection m hardware (via the mput 525 in clock edge register 515). At 
this pomt, the INPUT-EN wire line 533 to second buffer 512 in the primary clock register 510 

15 and the CLK-EN wire line 528 to second buffer 517 in the clock edge register 515 have not been 
asserted and thus, no data evaluation will take place. Thus, the clock edges will be detected 
before the data are evaluated in the hardware register model. Note that at this stage, the data 
fi:om the data bus on whe line 519 has not even propagated out to the gated data logic 513 and 
into the hardware-modeled user register 518. Indeed, the data have not even reached the second 

20 buffer 512 in the primary clock register 510 because the INPUT-EN signal on wire line 533 has 
not been asserted yet. 

During the I/O stage, the -EVAL signal on wire line 526 is asserted to enable the first 
buffer 516 in the clock edge register 515. The -EVAL signal also goes through the gated clock 
logic 514 to monitor the gated clock signal as it makes its way through the gated clock logic to 
25 the clock input on wire line 525 of first buffer 516. Thus, as will be explained later with respect 
to the 4-state evaluation state machine, the -EVAL signal can be maintained as long as necessary 
to stabilize the data and the clock signals through that portion of the system illustrated in FIG. 
19. 

When the signal has stabilized, I/O has concluded, or the system is otherwise ready to 
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evaluate the data, the -EVAL is deasserted to disable the first buffer 516. The CLK-EN signal is 
asserted and applied to second buffer 517 via wire line 528 to enable the second buffer 517 and 
send the logic 1 value on wire line 529 to the Q output on wire line 532 to the enable input for 
register 518. Register 518 is now enabled and any data present on wire line 530 will be 
5 synchronously clocked into the register 518 by the system clock. As the reader can observe, the 
enable signal to the register 518 runs faster than the evaluation of the data signal to this register 
518. 

The INPUT-EN signal on wire line 533 is not asserted to the second buffer 512. Also, 
the RESET edge register signal on wire line 527 is asserted to buffers 516 and 517 in the clock 

10 edge register 515 to reset these buffers and ensuring that their outputs are logic "0," Now that 
the INPUT-EN signal has been asserted for buffer 512, the data on wire line 521 now propagates 
to the gated data logic 513 to the user's circuit register 518 on wire line 530. Because the enable 
input to this register 518 is now logic "0/' the data on wire line 530 is cannot be clocked into the 
register 518. The previous data, however, has already been clocked in by the previously asserted 

15 enable signal on wire line 532 before the RESET signal was asserted to disable register 518. 
Thus the input data to register 518, as well as the inputs to other registers that are part of the 
user's hardware-modeled circuit design stabilize to their respective register input ports. When a 
clock edge is subsequently detected in software, the prunary clock register 510 and the clock 
edge register 515 in hardware activate the enable input to the register 518 so that the data waiting 

20 at the input of register 518 and other data waiting at the inputs to their respective registers are 
clocked in together and synchronously by the system clock. 

As discussed earlier, the software clock implementation relies primarily on the clock set- 
up shown on FIG. 19 along with the timing of the assertions of the '-EVAL, INPUT-EN, CLK- 
EN, and RESET signals. FIG. 20 shows a four state finite state machine to control the software 

25 clock logic of FIG. 19 in accordance with one embodiment of the present invention. 

At state 540, the system is idle or some I/O operation is under way. The --EVAL signal 
is logic ''0." The -EVAL signal determines the evaluation cycle, is generated by the system 
controller, and lasts as many clock cycles as needed to stabilize the logic in the system. Usually, 
the duration of the -EVAL signal is determined by the placement scheme during compilation and 
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is based on the length of the longest direct wire and the length of the longest segmented 
multiplexed wires (i.e., TDM circuits). During evaluation, -'EVAL signal is at logic "1." 

At state 541, the clock is enabled. The CLK-EN signal is asserted at logic " 1" and thus, 
the enable signal to the hardware register model is asserted. Here, previously gated data at the 
5 hardware register model is evaluated synchronously without risk of hold-tune violation. 

At state 542, the new data is enabled when INPUT-EN signal is asserted at logic "1." 
The RESET signal is also asserted to remove the enable signal from the hardware register model. 
However, the new data that had been enabled into the hardware register model through the gated 
data logic network continues to propagate to its intended hardware register model destination or 
y 10 has reached its destination and is waiting to be clocked into the hardware register model if and 
vP when the enable signal is asserted again. 

V At state 543, the propagating new data is stabilizing in the logic while the ~EVAL signal 

remain at logic " 1. " The muxed-wire, as discussed above for the time division multiplexed 
W (TDM) ch-cuit m association with FIGS. 9(A), 9(B), and 9(C), is also at logic "1." When the 
O 15 --EVAL signal is deasserted or set to logic "0," the system returns to the idle state 540 and waits 
|T to evaluate upon the detection of a clock edge by the software. 

C D. FPGA ARRAY AND CONTROL 

The SEmulator system initially compiles the user circuit design data into software and 
20 hardware models based on a variety of controls including component type. During the hardware 
compilation process, the system performs the mapping, placement, and routing process as 
described above with respect to FIG. 6 to optimally partition, place, and interconnect the various 
components that make up the user's circuit design. Using known programming tools, the 
bitstream configuration files or Programmer Object Files (.pof) (or alternatively, raw binary files 
25 (.rbf)) are referenced to reconfigure a hardware board containing a number of FPGA chips. Each 
chip contains a portion of the hardware model corresponding to the user's circuit design. 

In one embodiment, the SEmulator system uses a 4x4 array of FPGA chips, totaling 16 
chips. Exemplary FPGA chips include Xilinx XC4000 series family of FPGA logic devices and 
the Altera FLEX lOK devices. 
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The Xilinx XC4000 series of FPGAs can be used, including the XC4000, XC4000A, 
XC4000D, XC4000H, XC4000E, XC4000EX, XC4000L, and XC4000XL. Particular FPGAs 
include the Xilinx XC4005H, XC4025, and Xilinx 4028EX. The Xilinx XC4028EX FPGA 
engines approach half a million gates in capacity on a single PCI board. Details of these Xilinx 
5 FPGAs can be obtained in their data book, Xilinx, The Programmable Logic Data Book (9/96), 
which is incorporated herein by reference. For Altera FPGAs, details can be found in their data 
book. Altera, The 1996 Data Book (June 1996), which is incorporated herein by reference. 

A brief general description of the XC4025 FPGA will be provided. Each array chip 
consists of a 240-pin Xilinx chip. The array board populated with Xilinx XC4025 chips contains 
10 approximately 440,000 configurable gates, and is capable of performing computationally -intensive 
^1 tasks. The Xilinx XC4025 FPGA consists of 1024 configurable logic blocks (CLBs). Each CLB 
can implement 32 bits of asynchronous SRAM, or a small amount of general Boolean logic, and 
two strobed registers. On the periphery of the chip, unstrobed I/O registers are provided. An 
^ alternative to the XC4025 is the XC4005H. This is a relatively low -cost version of the array board 
C3 15 with 120,000 configurable gates. The XC4005H devices have high-power 24 mA drive circuits, but 
1^1 are missing the input/output flip/flops of the standard XC4000 series. Details of these and other 
Xilinx FPGAs can be obtained through their publicly available data sheets, which are 
incorporated herein by reference. 

The functionality of Xilinx XC4000 series FPGAs can be customized by loading 
20 configuration data into internal memory cells. The values stored in these memory cells determine 
the logic functions and interconnections in the FPGA. The configuration data of these FPGAs can 
be stored in on-chip memory and can be loaded from external memory. The FPGAs can either 
read configuration data fi-om an external serial or parallel PROM, or the configuration data can 
be written into the FPGAs from an external device. These FPGAs can be reprogrammed an 
25 unlimited number of times, especially where hardware is changed dynamically or where users 
desire the hardware to be adapted to different applications. 

Generally, the XC4000 series FPGAs has up to 1024 CLBs. Each CLB has two levels of 
look-up tables, with two four-input look-up tables (or function generators F and G) providing 
some of the inputs to a third three-input look-up table (or function generator H), and two flip- 
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flops or latches. The outputs of these look-up tables can be driven independent of these flip-flops 
or latches. The CLB can implement the following combination of arbitrary Boolean functions: 
(1) any function of four or five variables, (2) any function of four variables, any second function 
of up to four unrelated variables, and any third function of up to three unrelated variables, (3) 
5 one function of four variables and another function of six variables, (4) any two functions of four 
variables, and (5) some functions of nine variables. Two D type flip-flops or latches are available 
for registering CLB inputs or for storing look-up table outputs. These flip-flops can be used 
independently from the look-up tables, DIN can be used as a direct input to either one of these 
two flip-flops or latches and HI can drive the other through the H function generator. 

10 Each four-input function generators in the CLB (i.e., F and G) contains dedicated 

arithmetic logic for the fast generation of carry and borrow signals, which can be configured to 
implement a two-bit adder with carry-in and carry-out. These function generators can also be 
implemented as read/write random access memory (RAM). The four-input wire lines would be 
used as address lines for the RAM. 

15 The Altera FLEX lOK chips are somewhat similar in concept. These chips are SRAM- 

based programmable logic devices (PLDs) having muhiple 32-bit buses. In particular, each 
FLEX lOKlOO chip contains approximately 100,000 gates, 12 embedded array blocks (EABs), 
624 logic array blocks (LABs), 8 logic elements (LEs) per LAB (or 4,992 LEs), 5,392 flip-flops 
or registers, 406 I/O pins, and 503 total pins. 

20 The Altera FLEX lOK chips contain an embedded array of embedded array blocks (EABs) 

and a logic array of logic array blocks (LABs). An EAB can be used to implement various 
memory (e.g., RAM, ROM, FIFO) and complex logic functions (e.g., digital signal processors 
(DSPs), microcontrollers, multipliers, data transformation functions, state machines). As a 
memory function unplementation, the EAB provides 2,048 bits. As a logic function 

25 implementation, the EAB provides 100 to 600 gates. 

A LAB, via the LEs, can be used to implement medium sized blocks of logic. Each LAB 
represents approximately 96 logic gates and contains 8 LEs and a local interconnect. An LE 
contains a 4-input look-up table, a programmable flip-flop, and dedicated signal paths for carry 
and cascade functions. Typical logic functions that can be created include counters, address 
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decoders, or small state machines. 

More detailed descriptions of the Altera FLEXIOK chips can be found in Altera, 1996 
DATA BOOK (June 1996), which is incorporated herein by reference. The data book also 
contains details on the supporting programming software. 

5 FIG. 8 shows one embodiment of the 4x4 FPGA array and their interconnections. 

Note that this embodiment of the SEmulator does not use cross bar or partial cross bar 
connections for the FPGA chips. The FPGA chips include chips Fll to F14 in the first row, 
chips F21 to F24 in the second row, chips F31 to F34 in the third row, and chips F41 to F44 in 
the fourth row. In one embodiment, each FPGA chip (e.g., chip F23) has the following pins for 

10 the interface to the FPGA I/O controller of the SEmulator system: 



Interface 


Pins 


Data Bus 


32 


SPACE index 


3 


READ, WRITE, EVAL 


3 


DATA XSFR 


1 


Address pointer chain 


2 


TOTAL 


41 



Thus, in one embodiment, each FPGA chip uses only 41 pins for interfacing with the SEmulator 

system. These pins will be discussed further with respect to FIG. 22. 
15 These FPGA chips are interconnected to each other via non-crossbar or non-partial 

crossbar interconnections. Each interconnection between chips, such as interconnection 602 

between chip Fll and chip F14, represents 44 pins or 44 wire lines. In other embodiments, each 

interconnection represents more than 44 pins. Still in other embodiments, each interconnection 

represents less than 44 pins. 
20 Each chip has six interconnections. For example, chip Fll has interconnections 600 to 

605. Also, chip F33 has interconnections 606 to 611. These interconnections run horizontally 
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along a row and vertically along a column. Each interconnection provides a direct connection 
between two chips along a row or between two chips along a column. Thus, for example, 
interconnection 600 directly connects chip Fll and F13; interconnection 601 directly connects 
chip Fll and F12; interconnection 602 directly connects chip Fll and F14; interconnection 603 
5 directly connects chip Fll and F31, interconnection 604 directly connects chip Fll and F21; and 
interconnection 605 directly connects chip Fll and F41. 

Similarly, for a chip F33 that is not located on the edge of the array (e.g., chip Fll), 
interconnection 606 directly connects chip F33 and F13; uiterconnection 607 directly connects 
chip F33 and F23; interconnection 608 directly connects chip F33 and F34; interconnection 609 

10 directly connects chip F33 and F43, interconnection 610 directly connects chip F33 and F31; and 
interconnection 611 directly connects chip F33 and F32. 

Because chip Fll is located within one hop from chip F13, interconnection 600 is labeled 
as "1.'' Because chip Fll is located within one hop from chip F12, interconnection 601 is 
labeled as "1." Similarly, because chip Fll is located within one hop from chip F14, 

15 interconnection 602 is labeled as "1." Similarly, for chip F33, all uiterconnections are labeled as 

This interconnect scheme allows each chip to communicate with any other chip in the 
array within two "jumps" or interconnections. Thus, chip Fl 1 is connected to chip F33 through 
either of the following two paths: (1) interconnection 600 to interconnection 606; or (2) 

20 interconnection 603 to interconnection 610. In short, the path can be either: (1) along a row first 
and then along a column, or (2) along a column first and then along a row. 

Although FIG. 8 shows the FPGA chips configured in a 4x4 array with horizontal and 
vertical interconnections, the actual physical implementation on a board is through low and high 
banks with an expansion piggyback board. So, in one embodiment, chips F41-F44 and chips 

25 F21-F24 are in the low bank. Chips F31-F34 and chips F11-F14 are in the high bank. The 
piggyback board contains chips F11-F14 and chips F21-F24. Thus, to expand the array, 
piggyback boards containing a number (e.g., 8) of chips are added to the banks and hence, above 
the row currently containing chips F11-F14. In other embodiments, the piggyback board will 
expand the array below the row currently containing chips F41-F44. Further embodiments allow 
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expansion to the right of chips F14, F24, F34, and F44, Still other embodiments allow 
expansion to the left of chips Fll, F21, F31, and F41. 

Represented in terms of " 1" or "0," FIG. 7 shows a connectivity matrix for the 4x4 
FPGA array of FIG. 8. This connectivity matrix is used to generate a placement cost result from 
5 a cost function used in the hardware mapping, placement, and routing process for this SEmulation 
system. The cost function was discussed above with respect to FIG. 6. As an example, chip Fll 
is located within one hop from chip F13, so the connectivity matrix entry for Fl 1-F13 is " 1. " 

FIG. 21 shows the interconnect pin-outs for a single FPGA chip in accordance with one 
embodiment of the present invention. Each chip has six sets of interconnections, where each set 
10 comprises a particular number of pins. In one embodiment, each set has 44 pins. The 

interconnections for each FPGA chip are oriented horizontally (East- West) and vertically (North- 
South). The set of interconnections for the West direction is labeled as W[43:0]. The set of 
interconnections for the East direction is labeled as E[43:0]. The set of interconnections for the 
North direction is labeled as N[43:0]. The set of interconnections for the South direction is 
15 labeled as S[43:0]. These complete sets of interconnections are for the connections to adjacent 
chips; that is, these interconnections do not "hop" over any chip. For example, in FIG, 8, chip 
F33 has interconnection 607 for N[43:0], interconnection 608 for E[43:0], interconnection 609 
for S[43:0], and interconnection 611 for W[43:0]. 

Returning to FIG. 21, two additional sets of interconnections are remaining. One set of 
20 interconnections is for the non-adjacent interconnections running vertically - YH[21:0] and 
YH[43:22]. The other set of interconnections is for the non-adjacent interconnections running 
horizontally - XH[21:0] and XH[43:22]. Each set, YH[...] and XH[...], are divided into two, 
where each half of a set contains 22 pins. This configuration allows each chip to be 
manufactured identically. Thus, each chip is capable of being interconnected in one hop to a non- 
25 adjacent chip located above, below, left, and right. This FPGA chip also shows the pin(s) for 
global signals, the FPGA bus, and JTAG signals. 

The FPGA I/O controller will now be discussed. This controller was first briefly 
introduced in FIG. 10 as item 327. FPGA I/O controller manages the data and control traffic 
between the PCI bus and the FPGA array. 
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FIG. 22 shows one embodiment of the FPGA controller between the PCI bus and the 
FPGA array, along with the banks of FPGA chips. The FPGA I/O controller 700 includes 
CTRL FPGA unit 701, clock buffer 702, PCI controller 703, EEPROM 704, FPGA serial 
configuration interface 705, boundary scan test interface 706, and buffer 707. Appropriate 
5 power/ voltage regulating circuitry as known to those skilled in the art is provided. Exemplary 
sources include Vcc coupled to a voltage detector/regulator and a sense amplifier to substantially 
maintain the voltage in various environmental conditions. The Vcc to each FPGA chip is 
provided with fast acting thin-film fuses therebetween. The Vcc-HI is provided to the CONFIG# 
to all FPGA chips and LINTI# to a LOCAL BUS 708. 
y 10 The CTRL^FPGA unit 701 is the primary controller for FPGA I/O controller 700 to 

iy handle the various control, test, and read/write substantive data among the various units and 
M buses. CTRLJFPGA unit 701 is coupled to the low and high banks of FPGA chips. FPGA chips 
i2 F41-F44 and F21-F24 (i.e., low bank) are coupled to low FPGA bus 718. FPGA chips F31-F34 

and Fl 1-F14 (i.e. , high bank) are coupled to high FPGA bus 719. These FPGA chips Fl 1-F14, 
Q 15 F21-F24, F31-F34, and F41-F44 correspond to the FPGA chips in FIG. 8, retaining their 
Lk reference numbers. 

Jij Between these FPGA chips F11-F14, F21-F24, F31-F34, and F41-F44 and the low bank 

^ bus 718 and high bank bus 719 are thick film chip resistors for appropriate loading purposes. The 
group of resistors 713 coupled to the low bank bus 718, includes, for example, resistor 716 and 
20 resistor 717. The group of resistors 712 coupled to the high bank bus 719, includes, for 
example, resistor 714 and resistor 715. 

If expansion is desired, more FPGA chips may be installed on the low bank bus 718 and 
the high bank bus 719 in the direction to the right of FPGA chips Fl 1 and F21 . In one 
embodiment, expansion is done through piggyback boards resembling piggyback board 720. 
25 Thus, if these banks of FPGA chips initially had only eight FPGA chips F41-F44 and F3 1-F34, 
further expansion is possible by adding piggyback board 720, which contains FPGA chips F24- 
F21 in the low bank and chips F14-F11 in the high bank. The piggyback board 720 also includes 
the additional low and high bank bus, and the thick film chip resistors. 

The PCI controller 703 is the primary interface between the FPGA I/O controller 700 and 
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the 32-bit PCI bus 709. If the PCI bus expands to 64 bits and/or 66 MHz, appropriate 
adjustments can be made in this system without departing from the spirit and scope of the present 
invention. These adjustments will be discussed below. One example of a PCI controller 703 that 
may be used in the system is PLX Technology's PCI9080 or 9060. The PCI 9080 has the 
5 appropriate local bus hiterface, control registers, FIFOs, and PCI mterface to the PCI bus. The 
data book PLX Technology, PCI 9080 Data Sheet (ver. 0.93, Feb. 28, 1997) is incorporated 
herein by reference. 

The PCI controller 703 passes data between the CTRL_FPGA unit 701 and the PCI bus 
709 via a LOCAL BUS 708, LOCAL BUS includes control bus portion, address bus portion, 

10 and data bus portion for control signals, address signals, and data signals, respectively. If the 
PCI bus expands to 64 bits, the data bus portion of LOCAL BUS 708 can also expand to 64 bits. 
The PCI controller 703 is coupled to EEPROM 704, which contains the configuration data for the 
PCI controller 703. An exemplary EEPROM 704 is National Semiconductor's 93CS46. 

The PCI bus 709 supplies a clock signal at 33 MHz to the FPGA I/O controller 700. The 

15 clock signal is provided to clock buffer 702 via wire line 710 for synchronization purposes and 
for low timing skew. The output of this clock buffer 702 is the global clock (GL_CLK) signal at 
33 MHz supplied to all the FPGA chips via wire line 711 and to the CTRL FPGA unit 701 via 
wire line 721. If the PCI bus expands to 66 MHz, the clock buffer will also supply 66 MHz to 
the system. 

20 FPGA serial configuration interface 705 provides configuration data to configure the 

FPGA chips F11-F14, F21-F24, F31-F34, and F41-F44. The Altera data book, Altera, 1996 
DATA BOOK (June 1996), provides detailed information on the configuration devices and 
processes. FPGA serial configuration interface 705 is also coupled to LOCAL^BUS 708 and the 
parallel port 721. Furthermore, the FPGA serial configuration interface 705 is coupled to 

25 CTRL^FPGA unit 701 and the FPGA chips F11-F14, F21-F24, F31-F34, and F41-F44 via 
CONFJNTF wire line 723. 

The boundary scan test interface 706 provides JTAG implementations of certain specified 
test conamand set to externally check a processor's or system's logic units and circuits by 
software. This interface 706 complies with the IEEE Std. 1149.1-1990 specification. Refer to 
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the Altera data book, Altera, 1996 DATA BOOK (June 1996) and Application Note 39 (JTAG 
Boundary-Scan Testing in Altera Devices), both of which are incorporated herein by reference, 
for more information. Boundary scan test interface 706 is also coupled to LOCAL BUS 708 and 
the parallel port 722. Furthermore, the boundary scan test interface 706 is coupled to 
5 CTRL FPGA unit 701 and the FPGA chips Fl 1-F14, F21-F24, F3 1-F34, and F41-F44 via 
BSTJNTF wire Ime 724. 

CTRL_FPGA unit 701 passes data to/from the low (chips F41-F44 and F21-F24) and high 
(chips F31-F34 and F11-F14) banks of FPGA chips via low bank 32-bit bus 718 and high bank 
32-bit bus 719, respectively, along with buffer 707, and F_BUS 725 for the low bank 32 bits 
QlO FD[31:0] and F_BUS 726 for the high bank 32 bits FD[63:32]. 

iQ One embodiment duplicates the throughput of the PCI bus 709 in the low bank bus 718 

and the high bank bus 719. The PCI bus 709 is 32 bits wide at 33 MHz. The throughput is thus 
132 MB4>s (= 33 MHz * 4 Bytes). The low bank bus 718 is 32 bits at half the PCI bus 
in frequency (33/2 MHz = 16.5 MHz). The high bank bus 719 is also 32 bits at half the PCI bus 
g 15 frequency (33/2 MHz = 16.5 MHz). The throughput of the 64-bit low and high bank buses is 
f: also 132 MB4>s (= 16.5 MHz * 8 Bytes). Thus, the performance of the low and high bank buses 

tracks the performance of the PCI bus. In other words, the performance limitations are in the 
U PCI bus, not in the low and high bank buses. 

Address pointers, in accordance with one embodiment of the present invention, are also 
20 implemented in each FPGA chip for each software/hardware boundary address space. These 
address pointers are chained across several FPGA chips through the multiplexed cross chip 
address pointer chain. Please refer to the address pointer discussion above with respect to FIGS. 
9, 11, 12, 14, and 15. To move the word selection signal across the chain of address pointers 
associated with a given address space and across several chips, chain-out wire lines must be 
25 provided. These chain-out wire lines are shown as the arrows between the chips. One such 
chain-out wire line for the low bank is wire line 730 between chips F23 and F22. Another such 
chain-out wire line for the high bank is wire line 731 between chips F31 and F32. The chain-out 
wire line 732 at the end of low bank chip F21 is coupled to the CTRL FPGA unit 701 as 
LAST^SHIFT L. The chain-out wire line 733 at the end of high bank chip Fll is coupled to the 
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CTRL FPGA unit 701 as LAST_SHIFT_H. These signals LAST^SHIFT L and 
LAST SHIFT H are the word selection signals for their respective banks as the word selection 
signals are propagated through the FPGA chips. When either of these signals LAST SHIFT L 
and LAST_SHIFT_H presents a logic "1" to the CTRL_FPGA unit 701, this indicates that the 
5 word selection signal has made its way to the end of its respective bank of chips. 

The CTRL FPGA unit 701 provides a write signal (F_WR) on wire line 734, a read 
signal (F_RD) on wire line 735, a DATA XSFR signal on wne line 736, an -EVAL signal on 
wire line 737, and a SPACE[2:0] signal on wire line 738 to and from the FPGA chips. The 
CTRL FPGA unit 701 receives the EVAL_REQ# signal on wire line 739. The write signal 
JlO (F_WR), read signals (FJU)), DATAJCSFR signal, and SPACE[2:0] signal work together for 
v3 the address pointers in the FPGA chips. The write signal (F_WR), read signals (F_RD), and 
i SPACE[2:0] signal are used to generate the MOVE signal for the address pointers associated with 
the selected address space as determined by the SPACE index (SPACE[2:0]). The DATA_XSFR 
Ul signal is used to initialize the address pointers and begin the word-by-word data transfer process. 
□ 15 The EVAL_REQ# signal is used to start the evaluation cycle all over again if any of the 

^ FPGA chips asserts this signal. For example, to evaluate data, data is transferred or written from 
Jj main memory in the host processor's computing station to the FPGAs via the PCI bus. At the 
end of the transfer, the evaluation cycle begins including address pointer initialization and the 
operation of the software clocks to facilitate the evaluation process. However, for a variety of 
20 reasons, a particular FPGA chip may need to evaluate the data all over again. This FPGA chip 
asserts the EVAL_REQ# signal and the CNTL_FPGA chip 701 starts the evaluation cycle all 
over again. 

FIG. 23 shows a more detailed illustration of the CTRL FPGA unit 701 and buffer 707 of 
FIG. 22. The same input/output signals and their corresponding reference numbers for 
25 CTRL_FPGA unit 701 shown in FIG. 22 are also retained and used in FIG. 23. However, 

additional signals and wire/bus lines not shown in FIG. 22 will be described with new reference 
numbers, such as SEM FPGA output enable 1016, local interrupt output (Local INTO) 708a, 
local read/write control signals 708b, local address bus 708c, local interrupt input (Local INTI#) 
708d, and local data bus 708e. 
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CTRL FPGA unit 701 contains a Transfer Done Checking Logic (XSFR DONE Logic) 
1000, Evaluation Control Logic (EVAL Logic) 1001, DMA Descriptor Block 1002, Control 
Register 1003, Evaluation Timer Logic (EVAL timer) 1004, Address Decoder 1005, Write Flag 
Sequencer Logic 1006, FPGA Chip Read/Write Control Logic SEM_FPGA R/W Logic) 1007, 
5 Demultiplexer and Latch (DEMUX logic) 1008, and latches 1009-1012, which correspond to 
buffer 707 in FIG. 22. A global clock signal (CTRL FPGA CLK) on wire/bus 721 is provided 
to all logic elements/blocks in CTRL_FPGA unit 701 . 

The Transfer Done Checking Logic (XSFR DONE) 1000 receives LAST SHIFT H 733, 
LAST_SHIFT_L 732 and local INTO 708a. XSFR_DONE logic 1000 outputs a transfer done 
5 10 signal (XSFR DONE) on wire/bus 1013 to EVAL Logic 1001. Based on the reception of 
5 LAST_SHIFT_H 733 and LAST SHIFT L 732, the XSFR_DONE logic 1000 checks for the 
^ completion of the data transfer so that the evaluation cycle can begin, if desired. 

The EVAL Logic 1001 receives the EVAL_REQ# signal on wire/bus 739 and 
in WR_XSFR/RD_XSFR signal on wire/bus 1015, in addition to transfer done signal 
O 15 (XSFR_DONE) on wire/bus 1013. EVAL Logic 1001 generates two output signals. Start EVAL 
[J on wireA)us 1014 and DATA_XSFR on wire/bus 736. The EVAL logic indicates when data 

[U transfer between the FPGA bus and the PCI bus will beghi to initialize the address pointers. It 

o 

U receives the XSFR_DONE signal when the data transfer is complete. The WR_XSFR/RD_^XSFR 
signal indicates whether the transfer is a read or a write. Once the I/O cycle is complete (or 

20 before the onset of an I/O cycle), the EVAL logic can start the evaluation cycle with the start 

--EVAL signal to the EVAL timer. The EVAL timer dictates the duration of the evaluation cycle 
and ensures the successful operation of the software clock mechanism by keeping the evaluation 
cycle active for as long as necessary to stabilize the data propagation to all the registers and 
combinational components. 

25 DMA descriptor block 1002 receives the local bus address on wire/bus 1019, a write 

enable signal on wire/bus 1020 from address decoder 1005, and local bus data on wire/bus 1029 
via local data bus 708e. The output is DMA descriptor output on wire/bus 1046 to DEMUX 
logic 1008 on wire/bus 1045. The DMA descriptor block 1002 contains the descriptor block 
information corresponding to that in the host memory, including PCI address, local address, 

106 

SV/225583.01 
16503.302504 



transfer count, transfer direction, and address of the next descriptor block. The host will also set 
up the address of the initial descriptor block in the descriptor pointer register of the PCI 
controller. Transfers can be initiated by setting a control bit. The PCI loads the first descriptor 
block and initiates the data transfer. The PCI controller continues to load descriptor blocks and 
5 transfer data until it detects the end of the chain bit is set in the next descriptor pointer register. 

Address decoder 1005 receives and transmits local R/W control signals on bus 708b, and 
receives and transmits local address signals on bus 708c. The address decoder 1005 generates a 
write enable signal on wire/bus 1020 to the DMA descriptor 1002, a write enable signal on 
wire/bus 1021 to control register 1003, the FPGA address SPACE index on wire/bus 738, a 

y 10 control signal on wire/bus 1027, and another control signal on whe/bus 1024 to DEMUX logic 

i 1008. 

Control register 1003 receives the write enable signal on wire/bus 1021 from address 
decoder 1005, and data from wire/bus 1030 via local data bus 708e. The control register 1003 
in generates a WRXSFR/RD_XSFR signal on wire/bus 1015 to EVAL logic 1001, a Set EVAL 
□ 15 time signal on wire/bus 1041 to EVAL tuner 1004, and a SEM_FPGA output enable signal on 
T:^ wire/bus 1016 to the FPGA chips. The system uses the SEM FPGA output enable signal to turn 
1^ on or enable each FPGA chip selectively. Typically, the system enables each FPGA chip one at 
H a time. 

EVAL timer 1004 receives the Start EVAL signal on wire/bus 1014, and the Set EVAL 
20 time on wire/bus 1041. EVAL timer 1004 generates the -EVAL signal on wire/bus 737, an 
evaluation done (EVAL_DONE) signal on wire/bus 1017, and a Start write flag signal on 
wire/bus 1018 to the Write Flag Sequencer logic 1006. In one embodhnent, the EVAL timer is 6 
bits long 

The Write Flag Sequencer logic 1006 receives the Start write flag signal on wire/bus 1018 
25 from EVAL timer 1004. The Write Flag Sequencer logic 1006 generates a local R/W control 
signal on wu*e/bus 1022 to local R/W wire/bus 708b, local address signal on wire/bus 1023 to 
local address bus 708c, a local data signal on wire/bus 1028 to local data bus 708e, and local 
INTI# on wire/bus 708d. Upon receivmg the start write flag signal, the write flag sequencer 
logic begins the sequence of control signals to begin the memory write cycles to the PCI bus. 
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The SEM FPGA R/W Control logic 1007 receives control signals on wire/bus 1027 from 
the address decoder 1005, and local R/W control signal on wire/bus 1047 via local R/W control 
bus 708b. The SEM FPGA R/W Control logic 1007 generates enable signal on wire/bus 1035 to 
latch 1009, a control signal on wire/bus 1025 to the DEMUX logic 1008, an enable signal on 
5 wire/bus 1037 to latch 1011, an enable signal on wire/bus 1040 to latch 1012, a F_WR signal on 
wire/bus 734, and a F-RD signal on wire/bus 735. The SEM FPGA R/W Control logic 1007 
controls the various write and read data transfers to/from the FPGA low bank and high bank 
buses. 

The DEMUX logic 1008 is a multiplexer and a latch which receives four sets of input 
10 signals and outputs one set of signals on wire/bus 1026 to the local data bus 708e. The selector 
signals are the control signal on wire/bus 1025 from SEM_FPGA R/W control logic 1007 and the 
J: control signal on wire/bus 1024 from address decoder 1005. The DEMUX logic 1008 receives 
J 2 one set of inputs from EVAL_DONE signal on wire/bus 1042, XSFR DONE signal on wire/bus 
W 1043, and -EVAL signal on wire/bus 1044. This single set of signals is labeled as reference 
Q 15 number 1048. At any one time period, only one of these three signals, EVAL_DONE, 
£ XSFR DONE, and ^EVAL will be provided to DEMUX logic 1008 for possible selection. The 
y DEMUX logic 1008 also receives, as the other three sets of input signals, the DMA descriptor 
H output signal on wire/bus 1045 from the DMA descriptor block 1002, a data output on wke/bus 
1039 from latch 1012, and another data output on wire/bus 1034 from latch 1010. 
20 The data buffer between the CTRL FPGA unit 701 and the low and high FPGA bank bus 

comprise latches 1009 to 1012. Latch 1009 receives local bus data on wire/bus 1032 via 
wire/bus 1031 and local data bus 708e, and an enable signal on wire/bus 1035 from SEM_FPGA 
R/W Control logic 1007. Latch 1009 outputs data on wire/bus 1033 to latch 1010. 

Latch 1010 receives data on wire/bus 1033 from latch 1009, and an enable signal on 
25 wire/bus 1036 via wire/bus 1037 from SEM_FPGA R/W Control logic 1007. Latch 1010 
outputs data on wire/bus 725 to the FPGA low bank bus and the DEMUX logic 1008 via 
wire/bus 1034. 

Latch 1011 receives data on wire/bus 1031 from local data bus 708e, and an enable signal 
on wu-e/bus 1037 from SEM^FPGA R/W Control logic 1007. Latch 1011 outputs data on 
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wire/bus 726 to the FPGA high bank bus and on wire/bus 1038 to latch 1012. 

Latch 1012 receives data on wire/bus 1038 from latch 1011, and an enable signal on 
wire/bus 1040 from SEM_FPGA R/W Control logic 1007. Latch 1012 outputs data on wire/bus 
1039 to DEMUX 1008. 
5 FIG. 24 shows the 4x4 FPGA array, its relationship to the FPGA banks, and the 

expansion capability. Like FIG. 8, FIG. 24 shows the same 4x4 array. The CTRL FPGA unit 
740 is also shown. Low bank chips (chips F41-F44 and F21-F24) and high bank chips (chips 
F31-F34 and F11-F14) are arranged in an alternating manner. Thus, characterizing the row of 
FPGA chips from the bottom row to the top row: low bank-high bank-low bank-high bank. The 
yiO data transfer chain follows the banks in a predetermined order. The data transfer chain for the 
a low bank is shown by arrow 741 . The data transfer chain for the high bank is shown by arrow 
2i 742. The JTAG configuration chain is shown by arrow 743, which runs through the entire array 

of 16 chips from F41 to F44, F34 to F31, F21 to F24, and F14 to Fll, and back to the 
in CTRL_FPGA unit 740. 

Q 15 Expansion can be accomplished with piggyback boards. Assuming in FIG. 24 that the 

u original array of FPGA chips mcluded F41-F44 and F31-F34, the addition of two more rows of 

chips F21-F24 and F11-F14 can be accomplished with piggyback board 745. The piggyback 
U board 745 also includes the appropriate buses to extend the banks. Further expansion can be 
accomplished with more piggyback boards placed one on top of the other in the array. 
20 FIG. 25 shows one embodiment of the hardware start-up method. Step 800 initiates the 

power on or warm boot sequence. In step 801, the PCI controller reads the EEPROM for 
initialization. Step 802 reads and writes PCI controller registers in light of the initialization 
sequence. Step 803 boundary scan tests for all the FPGA chips in the array. Step 804 configures 
the CTRL FPGA unit in the FPGA I/O controller. Step 805 reads and writes the registers in the 
25 CTRL_FPGA unit. Step 806 sets up the PCI controller for DMA master read/write modes. 
Thereafter, the data is transferred and verified. Step 807 configures all the FPGA chips with a 
test design and verifies its correctness. At step 808, the hardware is ready for use. At this point, 
the system assumes all the steps resulted in a positive confirmation of the operability of the 
hardware, otherwise, the system would never reach step 808. 
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E. ALTERNATE EMBODIMENT USING DENSER FPGA CHIPS 
In one embodiment of the present invention, the FPGA logic devices are provided on 
individual boards. If more FPGA logic devices are required to model the user's circuit design 
5 than is provided in the board, multiple boards with more FPGA logic devices can be provided. 
The ability to add more boards into the Simulation system is a desirable feature of the present 
invention. In this embodiment, denser FPGA chips, such as Altera 10K130V and 10K250V, are 
used. Use of these chips alters the board design such that only four FPGA chips, instead of eight 
less dense FPGA chips (e.g., Altera lOKlOO), are used per board. 
' JIO The couplmg of these boards to the motherboard of the Simulation system presents a 

^1 challenge. The interconnection and connection schemes must compensate for the lack of a 
,g backplane. The FPGA array in the Simulation system is provided on the motherboard through a 
i2 particular board interconnect structure. Each chip may have up to eight sets of interconnections, 
where the interconnections are arranged according to adjacent direct-neighbor interconnects (i.e., 
1315 N[73:0], S[73:0], W[73:0], E[73:0]), and one-hop neighbor interconnects (i.e., NH[27:0], 
|i SH[27:0], XH[36:0], XH[72:37]), excluding the local bus connections, within a single board and 
^ across different boards. Each chip is capable of being interconnected directly to adjacent 

neighbor chips, or in one hop to a non-adjacent chip located above, below, left, and right. In the 
X direction (east-west), the array is a torus. In the Y direction (north-south), the array is a mesh. 
20 The interconnects alone can couple logic devices and other components within a single 

board. However, inter-board connectors are provided to couple these boards and interconnects 
together across different boards to carry signals between (1) the PCI bus via the motherboard and 
the array boards, and (2) any two array boards. Each board contains its own FPGA bus 
FD[63:0] that allows the FPGA logic devices to communicate with each other, the SRAM 
25 memory devices, and the CTRL_FPGA unit (FPGA I/O controller). The FPGA bus FD[63:0] is 
not provided across the multiple boards. The FPGA interconnects, however, provide 
connectivity among the FPGA logic devices across multiple boards although these interconnects 
are not related to the FPGA bus. On the other hand, the local bus is provided across all the 
boards. 
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A motherboard connector connects the board to the motherboard, and hence, to the PCI 
bus, power, and ground. For some boards, the motherboard connector is not used for direct 
connection to the motherboard. In a six-board configuration, only boards 1,3, and 5 are directly 
connected to the motherboard while the remaining boards 2, 4, and 6 rely on their neighbor 
5 boards for motherboard connectivity. Thus, every other board is directly connected to the 
motherboard, and interconnects and local buses of these boards are coupled together via inter- 
board connectors arranged solder-side to component-side. PCI signals are routed through one of 
the boards (typically the first board) only. Power and ground are applied to the other 
motherboard connectors for those boards. Placed solder-side to component-side, the various 
'^jlO inter-board connectors allow communication among the PCI bus components, the FPGA logic 
devices, memory devices, and various Simulation system control ckcuits. 

FIG. 56 shows a high level block diagram of the array of FPGA chip configuration in 
^ accordance with one embodiment of the present invention. A CTRL FPGA unit 1200, described 
tn above, is coupled to bus 1210 via lines 1209 and 1236. In one embodiment, the CTRL FPGA 
0 15 unit 1200 is a programmable logic device (PLD) in the form of an FPGA chip, such as an Altera 
[J 10K50 chip. Bus 1210 allows the CTRL FPGA unit 1200 to be coupled to other Simulation 
[zf array boards (if any) and other chips (e.g., PCI controller, EEPROM, clock buffer). FIG. 56 
shows other major functional blocks in the form of logic devices and memory devices. In one 
embodiment, the logic device is a programmable logic device (PLD) in the form of an FPGA 
20 chip, such as an Altera 10K130V or 10K250V chip. The 10K130V and 10K250V are pin 

compatible and each is a 599-pin PGA package. Thus, instead of the embodiment shown above 
with the eight Altera FLEX lOKlOO chips in the array, this embodiment uses only four chips of 
Altera's FLEX 10K130. One embodiment of the present invention describes the board containing 
these four logic devices and their interconnections. 
25 Because the user's design is modeled and configured in any number of these logic devices 

in the array, mter-FPGA logic device communication is necessary to connect one part of the 
user's circuit design to another part. Furthermore, initial configuration information and boundary 
scan tests are also supported by the inter-FPGA interconnects. Finally, the necessary Simulation 
system control signals must be accessible between the Simulation system and the FPGA logic 
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devices, 

FIG. 36 shows the hardware architecture an FPGA logic device used in the present 
invention. The FPGA logic device 1500 includes 102 top I/O pins, 102 bottom I/O pins, 111 left 
I/O pins, and 110 right I/O pins. Thus, the total number of interconnect pins is 425. 
5 Furthermore, an additional 45 I/O pins are dedicated for GCLK, FPGA bus FD[31:0] (for the 
high bank, FD[63:32] is dedicated), F^RD, F__WR, DATAXSFR, SHIFTIN, SHIFTOUT, 
SPACE[2:0], ^EVAL, EVAL REQ N, DEVICE OE (signal from CTRL_^FPGA unit to turn on 
the output pins of FPGA logic devices), and DEV CLRN (signal from CTRL FPGA unit to clear 
all the internal flip-flops before starting the simulation). Thus, any data and control signals that 

10 cross between any two FPGA logic devices are carried by these interconnections. The remaining 
pins are dedicated for power and ground. 

FIG. 37 shows the FPGA interconnect pin-outs for a single FPGA chip in accordance with 
one embodiment of the present invention. Each chip 1510 may have up to eight sets of 
interconnections, where each set comprises a particular number of pins. Some chips may have 

15 less than eight sets of interconnections depending on their respective positions on the board. In 
the preferred embodiment, all chips have seven sets of intercoimections, although the specific sets 
of interconnections used may vary from chip to chip depending on their respective location on the 
board. The interconnections for each FPGA chip are oriented horizontally (East-West) and 
vertically (North-South). The set of interconnections for the West direction is labeled as 

20 W[73:0]. The set of interconnections for the East direction is labeled as E[73:0]. The set of 

intercoimections for the North direction is labeled as N[73:0]. The set of interconnections for the 
South direction is labeled as S[73:0]. These complete sets of interconnections are for the 
connections to adjacent chips; that is, these intercoimections do not " hop" over any chip. For 
example, in FIG. 39, chip 1570 has interconnection 1540 for N[73:0], hiterconnection 1542 for 

25 W[73:0], interconnection 1543 for E[73:0], and mterconnection 1545 for S[73:0]. Note that this 
FPGA chip 1570, which is also the FPGA2 chip, has all four sets of adjacent interconnections - 
N[73:0], S[73:0], W[73:0], and E[73:0]. The West interconnections of FPGAO connects to the 
east intercoimections of FPGA3 through wire 1539 via a torus-style interconnections. Thus, wire 
1539 allows the chips 1569 (FPGAO) and 1572 (FPGA3) to be directly coupled to each other in a 
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maimer akin to wrapping the west-east ends of the board to be wrapped around to meet each 
other. 

Returning to FIG. 37, four sets of "hopping" interconnections are provided. Two sets of 
interconnections are for the non-adjacent interconnections running vertically - NH[27:0] and 
5 SH[27:0]. For example, FPGA2 chip 1570 in FIG. 39 shows NH interconnect 1541 and SH 
interconnect 1546. Returning to FIG. 37, the other two sets of interconnections are for the non- 
adjacent interconnections running horizontally - XH[36:0] and XH[72:37], For example, 
FPGA2 chip 1570 in FIG. 39 shows XH interconnect 1544. 

Returning to FIG. 37, the vertical hopping interconnections NH[27:0] and SH[27:0] have 
J^IO 28 pins each. The horizontal interconnections have 73 pins, XH[36:0] and XH[72:37]. The 
C horizontal interconnection pins, XH[36:0] and XH[72:37], can be used on the west side (e.g., for 
% FPGA3 chip 1576, interconnect 1605 in FIG, 39) and/or the east side (e.g., for FPGAO chip 
I? 1573, interconnect 1602 in FIG. 39). This configuration allows each chip to be manufactured 

identically. Thus, each chip is capable of being interconnected in one hop to a non-adjacent chip 
1315 located above, below, left, and right. 

FIG. 39 shows a direct-neighbor and one-hop neighbor FPGA array layout of the six 
boards on a single motherboard in accordance with one embodiment of the present invention. 
This figure will be used to illustrate two possible configurations - a six-board system and a dual- 
board system. Position indicator 1550 shows that the " Y" direction is north-south and the "X" 
20 direction is east-west. In the X direction, the array is a torus. In the Y direction, the array is a 
mesh. In FIG. 39, only the boards, FPGA logic devices, interconnects, and connectors at a high 
level are shown. The motherboard and other supporting components (e.g., SRAM memory 
devices) and wire lines (e.g., FPGA bus) are not shown. 

Note that FIG. 39 provides an array view of the boards and their components, 
25 interconnects, and connectors. The actual physical configuration and installation involves placing 
these boards on their respective edges component-side to soider-side. Approximately half of the 
boards are directly connected to the motherboard while the other half of the boards are connected 
to their respective neighbor boards. 

In the six-board embodiment of the present invention, six boards 1551 (boardl), 1552 
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(board2), 1553 (boardS), 1554 (board4), 1555 (board5), and 1556 (board6) are provided on the 
motherboard (not shown) as part of the reconfigurable hardware unit 20 in FIG. 1. Each board 
contains an almost identical set of components and connectors. Thus, for illustrative purposes, 
the sixth board 1556 contains FPGA logic devices 1565 to 1568, and connectors 1557 to 1560 
5 and 1581; the fifth board 1555 contains FPGA logic devices 1569 to 1572 and connectors 1582 
and 1583; and the fourth board 1554 contains FPGA logic devices 1573 to 1576, and connectors 
1584 and 1585. 

In this six-board configuration, boardl 1551 and board6 1556 are provided as "bookend" 
boards that contain the Y-mesh terminations such as R-pack terminations 1557 to 1560 on board6 
^;;!lO 1556 and terminations 1591 to 1594 on boardl 1551. Intermediately placed boards (i.e., boards 

« 1552 (board2), 1553 (board3), 1554 (board4), and 1555 (board5)) are also provided to complete 

J I J 

^ the array. 

As explained above, the interconnects are arranged according to adjacent direct-neighbor 
W interconnects (i.e., N[73:0], S[73:0], W[73:0], E[73:0]), and one-hop neighbor interconnects 
C3 15 (i.e., NH[27:0], SH[27:0], XH[36:0], XH[72:37]), excluding the local bus connections, within a 
iT single board and across different boards. The interconnects alone can couple logic devices and 
I; other components within a single board. However, inter-board connectors 1581 to 1590 allow 
H communication among the FPGA logic devices across different boards (i.e., boardl to board6). 
The FPGA bus is part of the mter-board connectors 1581 to 1590. These connectors 1581 to 
20 1590 are 600-pin connectors carrying 520 signals and 80 power/ground connections between two 
adjacent array boards . 

In FIG. 39, the various boards are arranged in a non-symmetrical manner with respect to 
the inter-board connectors 1581 to 1590. For example, between board 1551 and 1552, inter- 
board connectors 1589 and 1590 are provided. Intercoimect 1515 connects FPGA logic devices 
25 1511 and 1577 together and according to connectors 1589 and 1590, this connection is 

symmetrical. However, interconnect 1603 is not symmetrical; it connects an FPGA logic device 
in the third board 1553 to the FPGA logic device 1577 in board 155L With respect to 
connectors 1589 and 1590, such an interconnect is not symmetrical. Shnilarly, interconnect 1600 
is not symmetrical with respect to connectors 1589 and 1590 because it connects FPGA logic 
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device 1577 to the termination 1591, which connects to FPGA logic device 1577 via interconnect 
1601. Other similar interconnects exist which farther shows the non-symmetry. 

As a result of this non-symmetry, the interconnects are routed through the inter-board 
connectors in two different ways - one for symmetric interconnects like interconnect 1515 and 
5 another for non-symmetric interconnects like interconnects 1603 and 1600. The interconnection 
routing scheme is shown in FIGS. 40(A) and 40(B). 

In FIG. 39, an example of a direct-neighbor connection within a single board is 
interconnect 1543 which couples logic device 1570 to logic device 1571 along the east- west 
direction in board 1555, Another example of a direct-neighbor connection within a single board 
;'ilO is interconnect 1607 which couples logic device 1573 to logic device 1576 in board 1554. An 
S example of a direct-neighbor connection between two different boards is interconnect 1545 which 
couples logic device 1570 in board 1555 to logic device 1574 in board 1554 via connectors 1583 
i2 and 1584 along the north-south direction. Here, two inter-board connectors 1583 and 1584 are 
''^ used to transport signals across. 

C3 15 An example of a one-hop interconnect withm a single board is interconnect 1544 which 

Q couples logic device 1570 to logic device 1572 in board 1555 along the east-west direction. An 
example of a one-hop inter coimect between two different boards is interconnect 1599 which 
couples logic device 1565 in board 1556 to logic device 1573 in board 1554 via connectors 1581 
to 1584. Here, four inter-board connectors 1581 to 1584 are used to transport signals across. 
20 Some boards, especially those positioned at the north-south ends on the motherboard, also 

contain 10-ohm R-packs to terminate some connections. Thus, the sixth board 1556 includes the 
10-ohm R-pack coimectors 1557 to 1560, and the first board 1551 includes the 10-ohm R-pack 
connectors 1591 to 1594. The sixth board 1556 contains R-pack connector 1557 for 
interconnects 1970 and 1971, R-pack connector 1558 for interconnects 1972 and 1541, R-pack 
25 connector 1559 for interconnects 1973 and 1974, and R-pack connector 1560 for interconnects 
1975 and 1976. Moreover, interconnects 1561 to 1564 are not connected to anything. These 
north-south interconnections, unlike the east-west torus-type interconnections, are arranged in 
mesh-type fashion. 

These mesh terminations increase the number of north-south direct interconnections. 
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otherwise, the interconnections at north and south edges of the FPGA mesh will be all wasted. 
For example, FPGA logic devices 1511 and 1577 aheady have one set of direct interconnection 
1515. Additional mterconnections are also provided for these two FPGA logic devices via R- 
pack 1591 and interconnects 1600 and 1601; that is, R-pack 1591 connects interconnects 1600 
5 and 1601 together. This increases the number of direct connections between FPGA logic devices 
1511 and 1577. 

Inter-board connections are also provided. Logic devices 1577, 1578, 1579, and 1580 on 
board 1551 are coupled to logic devices 1511, 1512, 1513, and 1514 on board 1552 via 
interconnects 1515, 1516, 1517, and 1518 and inter-board connectors 1589 and 1590. Thus, 
10 interconnect 1515 couples the logic device 1511 on board 1552 to logic device 1577 on board 

1551 via connectors 1589 and 1590; interconnect 1516 couples the logic device 1512 on board 

1552 to logic device 1578 on board 1551 via connectors 1589 and 1590; interconnect 1517 
couples the logic device 1513 on board 1552 to logic device 1579 on board 1551 via connectors 
1589 and 1590; and interconnect 1518 couples the logic device 1514 on board 1552 to logic 

15 device 1580 on board 1551 via connectors 1589 and 1590. 

Some interconnects such as interconnects 1595, 1596, 1597, and 1598 are not coupled to 
anythmg because they are not used. However, as mentioned above with respect to logic devices 
1511 and 1577, R-pack 1591 connects interconnects 1600 and 1601 to increase the number of 
north-south interconnects. 

20 A dual-board embodiment of the present invention is illustrated in HG. 44. In the dual- 

board embodiment of the present invention, only two boards are necessary to model the user's 
design in the Simulation system. Like the six-board configuration of FIG. 39, the dual-board 
configuration of FIG. 44 uses the same two boards for "bookends" - boardl 1551 and board6 
1556, which are provided on a motherboard as part of the reconfigurable hardware unit 20 in 

25 FIG. 1 . In FIG. 44, one bookend board is boardl and the second bookend board is board6. 

Board6 is used in FIG. 44 to show its similarity to board6 in FIG. 39; that is, the bookend boards 
like boardl and board6 should have the requisite terminations for the north-south mesh 
connections. 

This dual-board configuration contains four FPGA logic devices 1577 (FPGAO), 1578 
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(FPGAl), 1579 (FPGA2), and 1580 (FPGA3) onboardl 1551, and four FPGA logic devices 
1565 (FPGAO), 1566 (FPGAl), 1567 (FPGA2), and 1568 (FPGA3) on board6 1556. These two 
boards are connected by inter-board connectors 1581 and 1590. 

These boards contain 10-ohm R-packs to terminate some connections. For the dual-board 
5 embodiment, both boards are the "bookend" boards. Board 1551 contains 10-ohm R-pack 
connectors 1591, 1592, 1593, and 1594 as resistive terminations. The second board 1556 also 
contains the 10-ohm R-pack connectors 1557 to 1560. 

Board 1551 has connector 1590 and board 1556 has connector 1581 for inter-board 
communication. The intercoimects that cross from one board to another, such as interconnects 
QlO 1600, 1971, 1977, 1541, and 1540, go through these connectors 1590 and 1581; m other words, 
v3 the inter-board connectors 1590 and 1581 enable the interconnects 1600, 1971, 1977, 1541, and 
^ 1540 to make the connection between one component on one board and another component on 
,2 another board. The inter-board connectors 1590 and 1581 carry control data and control signals 
in on die FPGA buses. 

p 15 For four-board configurations, boardl and board6 provide the bookend boards, while 

2 board2 1552 and board3 1553 (see FIG. 39) are the intermediate boards. When coupled to the 
[U motherboard ni accordance with the present mvention (to be discussed with respect to FIGS. 
1^. 38(A) and 38(B)), boardl and board2 are paired and board3 and board6 are paired. 

For six-board configurations, boardl and board6 provide the bookend boards as discussed 
20 above, while board2 1552, boardS 1553, board4 1554, and board5 1555 (see FIG. 39) are the 
intermediate boards. When coupled to the motherboard in accordance with the present invention 
(to be discussed with respect to FIGS. 38(A) and 38(B)), boardl and board2 are paired, board3 
and board4 are paired, and board5 and board6 are paired. 

More boards can be provided as necessary. However, regardless of the number of boards 
25 that will be added to the system, the bookend boards (such boardl and board6 of FIG. 39) should 
have the requisite terminations that complete the mesh array connections. In one embodiment, 
the minimum configuration is the dual-board configuration of FIG. 44. More boards can be 
added by two-board increments. If the initial configuration had boardl and board6, a future 
modification to a four-board configuration involves moving the board6 further out and pairing 
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boardl and board2 together, and then pairing boards and board6 together, as mentioned above. 

As described above, each logic device is coupled to its adjacent neighbor logic device and 
its non-adjacent neighbor logic device within one hop. Thus, in FIGS. 39 and 44, logic device 
1577 is coupled to adjacent neighbor logic device 1578 via interconnect 1547. Logic device 1577 
5 is also coupled to non-adjacent logic device 1579 via one-hop interconnect 1548. However, logic 
device 1580 is considered to be adjacent to logic device 1577 due to the wrap-around torus 
configuration with interconnect 1549 providing the coupling. 

Various board layouts are possible with the present invention. Each board may hold any 
number of rows of FPGA chips, limited only by the physical dimensions of the system hardware. 

10 Interconnects between adjacent boards extend the FPGA array uniformly in one dimension. 

Thus, a single board with one row of four FPGA chips provides a 1x4 array. By adding a second 
board with one row of four FPGA chips and the proper interconnects, the array has been 
extended to 2x4. If the extension is due to the addition of more rows, the extension is vertical. 
In order to achieve this expandability, the I/O signals of the FPGA array in each board are 

15 grouped into two categories - Group C and Group S. 

Group C signals are connected to the next board by using connectors on the component 
side of the PCB. These connectors are at one edge of the FPGA array to facilitate short trace 
lengths and provide a lower number of signal layers for this PCB design. Group S signals are 
connected to the previous board by using connectors on the solder side of the PCB. These 

20 connectors are at the other edge of the FPGA array to facilitate short trace lengths and provide a 
lower number of signal layers for this PCB design. For example, referring now to FIG. 85, 
board 3 includes a single with exemplary FPGA chip FPGAO. The Group C component side 
signals are represented by CI, C2, and C3 on one edge, while the Group S solder side signals are 
represented by S4, S5, and S6 on the other edge. 

25 As a general rule, two adjacent boards are interconnected by mating connectors of Group 

C and Group S of these two boards at the same edge. In other words, these two boards are 
interconnected to each other at the top edge or the bottom edge. However, the interconnect must 
not pass through the motherboard or other backplane to achieve high packaging density, short 
trace lengths, and better performance. In contrast, the motherboard or backplane methods 
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require all the connectors to be placed at only one edge of the board, thus forcing all I/O signals 
from the other edge of the FPGA array to be routed across the board. Today's FPGA chip has 
over 500 I/O pins and the number of interconnect signals reaches thousands. It may not be 
feasible to design a compact mterconnect system by using out-of-shelf connectors. The array 
5 layout design of the present invention of placing two-group connectors at both edges of the FPGA 
board doubles the maximum possible number of interconnect signals per board. Furthermore, the 
design of the present invention reduces the complexity of the PCB design. 

For those FPGA arrays with direct and one-hop connections, odd and even boards utilize 
different connections between the I/O signals and the connectors. FIGS. 85-88 show the various 
plO mter-board connection schemes for those FPGA boards with single-, dual-, triple, and quadruple- 
2 rows. For simplicity, only one column is shown for each board layout. The mating connectors 
^ at the interconnects are pairs of Group C and Group S connectors with the same pin position (X, 
H Y coordinates on the board), such as CI and SI, C2 and S2, etc. 

in In the single row configuration, FIG. 85 shows eight boards and as mentioned above, one 

1=^15 colunrn. Because only one colunm is shown, only the first FPGA chip FPGAO of each board is 
J3 shown. To illustrate the interconnect scheme, the first three boards will be examined. The north 
fU edge of board 1 is aligned with the north edge of board 2 and board 3. However, the north edges 
2 of board 1 and board 2 are interconnected, while the north edges of board 2 and 3 are not 

interconnected. Also, the south edges of board 1, board 2, and board 3 are aligned. However, 
20 only the south edges of boards 2 and 3 are mtercoimected. Between board 1 and board 2, direct 
neighbor north connection CI, C2, and C3 in board 1 are coupled to north connection SI, 82, 
and S3 of board 2, respectively. However, only the CI -SI connection is direct. The connection 
C2-S2 is one-hop (between board 1 and board 3 via connectors C5 and S5) and C3-S3 is another 
one-hop (between board 2 and termination via connector S6). Similarly, between board 2 and 
25 board 3, direct neighbor south connection C4, C5, and C6 in board 2 are coupled to south 
connection S4, S5, and S6 of board 3, respectively. However, only the C4-S4 connection is 
direct. The connection C5-S5 is one-hop (between board 1 and board 3 via connectors C2 and S2) 
and C6-S6 is another one-hop (between board 2 and board 4 via connectors C3 and S3). Because 
only one row is provided in each board, the one-hop appears to be skipping boards. However, as 
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more rows of chips are added, the one-hop concept refers to the skipping of a chip. Thus, even 
in one board, the one-shop connection is between two chips that are not adjacent to each other; 
that is, the connection has to skip over one chip between the two connecting chips. 

In the dual row configuration, FIG. 86 shows four boards and as mentioned above, one 
5 colunm. Because only one column is shown, only the first two FPGA chips FPGAO and FPGAl 
of each board are shown. To illustrate the interconnect scheme, the first three boards will be 
examined. The north edge of board 1 is aligned with the north edge of board 2 and board 3. 
However, the north edges of board 1 and board 2 are interconnected, while the north edges of 
board 2 and 3 are not interconnected. Also, the south edges of board 1, board 2, and board 3 are 

10 aligned. However, only the south edges of boards 2 and 3 are interconnected. Between board 1 
and board 2, direct neighbor north connection CI, C2, and C3 in board 1 are coupled to north 
connection SI, S2, and S3 of board 2, respectively. However, only the CI -SI connection is 
direct. The connection C2-S2 is one-hop (between chip FPGAl in board 1 and chip FPAO in 
board 2 via connectors C5 and S5) and C3-S3 is another one-hop (between chip FPGAl in board 

15 2 and chip FPGAO in board 1). Similarly, between board 2 and board 3, direct neighbor south 
connection C4, C5, and C6 in board 2 are coupled to south connection S4, S5, and S6 of board 
3, respectively. However, only the C4-S4 connection is direct. The connections C5-S5 and C6- 
S6 are one-hop connections (one chip between the connecting chips is skipped). 

Note that the inter-board interconnects are provided by the FPGA chips at the edges of 

20 each board. Also, the interconnects at the north edges are coupled together, while the 
interconnects at the south edges are coupled together. 

A similar concept is utilized for the triple-row configuration shown in FIG. 87 and the 
quadruple-row layout of FIG. 88. The interconnect scheme for the triple-row layout is 
summarized in the table provided in FIG. 89. Generally, as odd-numbered boards are installed, 

25 only connectors CI, C2, C3, S4, S5, and S6 are loaded. For even-numbered boards, only 
connectors SI, S2, s3, C4, C5, and C6 are loaded. Some pin positions (e.g., 1 and 4) of both 
component-side and solder-side are connected to the same direct-connect signals (N, S). For 
example, CI and SI are connected to FPGA2 (N), while C4 and S4 are connected to FPGAO (S). 
Other pms positions (e.g., 2, 3, 5, 6) of component-side and solder-side are connected to 
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different one-hop I/O signals (SH, NH). For example, C2 connects to FPGA2 (NH) and S2 
connects to FPGAl (NH). In these cases, the inter-board connectors are surface-mount type 
instead of through-hole type. 

FIG. 42 shows a top view (component side) of the on-board components and connectors 
5 for a single board. In one embodiment of the present invention, only one board is necessary to 
model the user's design in the Simulation system. In other embodiments, multiple boards (i.e., at 
least 2 boards) are necessary. Thus, for example, FIG. 39 shows six boards 1551 to 1556 
coupled together through various 600-pin connectors 1581 to 1590. At the top and bottom ends, 
board 1551 is terminated by one set of 10-ohm R-packs and board 1556 is terminated by another 
JilO set of 10-ohm R-packs. 

Returning to FIG. 42, board 1820 contains four FPGA logic devices 1822 (FPGAO), 1823 
i (FPGAl), 1824 (FPGA2), and 1825 (FPGA3). Two SRAM memory devices 1828 and 1829 are 
I? also provided. These SRAM memory devices 1828 and 1829 will be used to map the memory 

blocks from the logic devices on this board; in other words, the memory Simulation aspect of the 
C3 15 present invention maps memory blocks from the logic devices on this board to the SRAM 
|I memory devices on this board. Other boards will contain other logic devices and memory 
55 devices to accomplish a similar mapping operation. In one embodiment, the memory mapping is 
dependent on the boards; that is, memory mapping for board 1 is limited to logic devices and 
memory devices on boardl while disregarding other boards. In other embodiments, the memory 
20 mapping is independent of the boards. Thus, a few large memory devices will be used to map 
memory blocks from logic devices on one board to memory devices located on another board. 

Light-emitting diodes (LEDs) 1821 are also provided to visually indicate some select 
activities. The LED display is as follows in Table A in accordance with one embodiment of the 
present invention: 

25 

TABLE A: LED DISPLAY 





Color 


State 


Description 


LEDl 


Green 


On 


+5V and +33V are normal. 






Off 


+5V or +3.3V are abnormal. 
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LED 








LED2 


Amber 


Off 


All on-board FPGA configuration is done. 






Blink 


On-board FPGAs are not configured or configuration failed 






On 


FPGA configuration is in process 


LED3 


Red 


On 


Data transfer is in process. 






Off 


No data transfer 






Blink 


Diagnostic tests fail 



Various other control chips such as the PLX PCI controller 1826 and CTRL_FPGA unit 
1827 control inter-FPGA and PCI communications. One example of a PLX PCI controller 1826 
J that may be used in the system is PLX Technology's PCI9080 or 9060. The PCI 9080 has the 

5 appropriate local bus interface, control registers, FIFOs, and PCI interface to the PCI bus. The 
^ data book PLX Technology, PCI 9080 Data Sheet (ver. 0.93, Feb. 28, 1997) is incorporated 

herein by reference. One example of the CTRL_FPGA unit 1827 is a programmable logic device 
'''' (PLD) in the form of an FPGA chip, such as an Altera 10K50 chip. In multiple board 
'J configurations, only the first board coupled to the PCI bus contains the PCI controller. 
M 10 Connector 1830 connects the board 1820 to the motherboard (not shown), and hence, the 

^ PCI bus, power, and ground. For some boards, the connector 1830 is not used for direct 
- connection to the motherboard. Thus, in a dual-board configuration, only the first board is 

directly coupled to the motherboard. In a six-board configuration, only boards 1,3, and 5 are 
directly connected to the motherboard while the remaining boards 2,4, and 6 rely on their 
15 neighbor boards for motherboard connectivity. Inter-board connectors Jl to J28 are also 

provided. As the name implies, these connectors Jl to J28 allow connections across different 
boards. 

Connector Jl is for external power and ground connections. The following Table B 
shows the pins and corresponding description for the external power connector Jl in accordance 
20 with one embodiment of the present invention: 



TABLE B: EXTERNAL POWER - Jl 
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1 


VCC5V 


2 


GND 


3 


GND 


4 


VCC3V 



Connector J2 is for the parallel port connection. Connectors Jl and J2 are used for stand- 
alone single-board boundary scan test during production. The following Table C shows the pins 
and corresponding description for the parallel JTAG port connector J2 in accordance with one 
5 embodiment of the present invention; 



TABLE C: PARALLEL JTAG PORT - J2 







Board 


Npali!!E9r.:.,:'_ 'I, ' 




3 


PARA_TCK 


I 


2 


DO 


5 


PARA_TMS 


I 


3 


Dl 


7 


PARA_TDI 


I 


4 


D2 


9 


PARANR 


I 


5 


D3 


19 


PARATDO 


O 


10 


NACK 


10, 12, 14, 16, 
18,20,22,24 


GND 




18-25 


GND 



Connectors J3 and J4 are for the local bus connections across boards. Connectors J5 to 
10 J16 are one set of FPGA interconnect connections. Connectors J17 to J28 are a second set of 
FPGA interconnect connections. When placed component-side to solder-side, these connectors 
provide effective connections between one component in one board with another component in 
another board. The following Tables D and E provide a complete list and description of the 
cotmectors Jl to J28 in accordance with one embodiment of the present invention: 

15 

TABLE D: CONNECTORS J1-J28 
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lifUVI 



iiil 



J5 



+ 5V/+3V external power 



Parallel Port 



Local Bus 



Local Bus 



J6 



J9 



JIO 



Jll 



J12 



Row A: NH[0], VCC3V, GND 
Row B: J17 Row B, VCC3V, GND 



Row A: J5 Row B, VCC3V, GND 
r B: J5 Row A, VCC3V, GND 
Row A: N[0], 4x VCC3V, 4x GNU, INLZJ 
f B: N[0], 4x VCC3V, 4x GND, N[2] 
Row A: N[0], 4x VCC3V, 4x GND, N[2J 
Row B: N[0], 4x VCC3V, 4x GND, N[2] 



Row A: NH[2], LASTL, GND 
Row B: J21 Row B, GND 
Row A: J9 Row B, FIRSTL, GND 

N B: J9 Row A, GND 
Row A: NH[1], VCC3V, GND 
Row B: J23 Row B, VCC3V, GND 



4-pin power RA header, comp side 

0.1" pitch, 2-row thru-hole RA header, comp side 

0.05" pitch, 2x30 thru-hole header, SAMTEC, comp 
side 

0.05" pitch, 2x30 thru-hole receptacle, SAMTEC, 
solder side 



0,05" pitch, 2x30 SMD header, SAMTEC, comp 
side 



0.05" pitch, 2x30 SMD receptacle, SAMTEC, 
solder side 

0.05" pitch, 2x45 thru-hole header, SAMTEC, 
comp/solder side 
0.05" pitch, 2x45 thru-hole receptacle, SAMTEC, 
comp/solder side 



0.05" pitch, 2x30 SMD header, SAMTEC, comp 
side 

0.05" pitch, 2x30 SMD receptacle, SAMTEC, 
solder side 

0.05" pitch, 2x30 SMD header, i>AMTEC, comp 
side 



Row A: Jll Row B, VCC3V, GND 
Row B: Jll Row A, VCC3V, GND 



J15 



Row A: N[l], 4x VCC3V, 4x GND, ^IJJ 
w B: N[l], 4x VCC3V, 4x GND, N[3] 
Row A: N[l], 4x VCC3V, 4x GND, NL3J 
►w B: N[i], 4x VCC3V, 4x GND, N[3] 
Row A: NH[3], LASTH, GND 

)W B: J27 Row B, GND 
Row A: J15 Row B, FIRSTH, GND 
Row B: J15 Row A, GND 



0.05" pitch, 2x30 SMD receptacle, SAMTEC, 



solder side 



0.05" pitch, 2x45 thru-hole header, SAMTEC, 
comp/solder side 
0.05" pitch, 2x45 thru-hole receptacle, SAMTEC, 
comp/solder side 
0.05" pitch, 2x30 SMD header, i>AMTEC, comp 
side 

0.05" pitch, 2x30 SMD receptacle, SAMTEC, 
solder side 



J17 



Row A: SH[0], VCC3V, GND 
Row B: J5 Row B, VCC3V, GND 



0.05" pitch, 2x30 SMD header, SAMTEC, comp 
side 
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J18 


Row A: J17 Row B, VCC3V, GND 
T17 Pnw A vrrw GND 


0.05" pitch, 2x30 SMD receptacle, SAMTEC, 
solder side 




Row A: S[0], 4x VCC3V, 4x GND, S[2] 


0.05" pitch, 2x45 thru-hole header, SAMTEC, 
comp/solder side 




Row A: S[0], 4x VCC3V, 4x GND, S[2] 
D^«7 «:rm 4v vrrW 4x GND Sr21 


0.05" pitch, 2x45 thru-hole receptacle, SAMTEC, 
comp/solder side 


J21 


Row A: SH12], LASTL, GND 

KOW b: Jy KOW r>, vjiNlJ 


0.05" pitch, 2x30 SMD header, SAMTEC, comp 
side 


J22 


Row A: J21 Row B, FIRSTL. GND 

Row D. Mx KOW A, UiNl^ 


0.05" pitch, 2x30 SMD receptacle, SAMTEC, 
solder side 


J23 


Row A: SH[1], VCC3V, GND 
Row b: Jll KOW r>, VL^^^ov, vjinj^ 


0.05" pitch, 2x30 SMD header, SAMTEC, comp 
side 


J24 


Row A; J23 Row B, VCC3V, GND 

Row B: J2i KOW A, VL^i^o Y , ui^iJ^ 


0.05" pitch, 2x30 SMD receptacle, SAMTEC, 
solder side 




Row A: S[l], 4x VCC3V, 4x GND, S[3] 


0.05" pitch, 2x45 thru-hole header, SAMTEC, 
comp/solder side 




, . — 

Row A: S[l], 4x VCC3V, 4x GND, S[3] 

T5™rT3. cm Av vrPW 4y GND SRI 


0.05" pitch, 2x45 thru-hole receptacle, SAMTEC, 
comp/solder side 


J27 


Row A: SH[3], LASTH, GND 
RowB: JlSRowB, GND 


0.05" pitch, 2x30 SMD header, SAMTEC, comp 
side 


J28 


Row A: J27 Row B, FIRSTH, GND 
Row B: J27 Row A, GND 


0.05" pitch, 2x30 SMD receptacle, SAMTEC, 
solder side 



Shaded connectors are through-hole type. Note that in Table D, the number in the brackets [ ] 
represents the FPGA logic device number 0 to 3. Thus, S[0] indicates the south interconnection 
(i.e., S[73:0] in FIG. 37) and its 74 bits of FPGAO. 



TABLE E: LOCAL BUS CONNECTORS - J3, J4 



Number 




m I/O 










GND 


PWR 




LRESET_N 


I/O 
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= ^ ■■ ■■■■ r= =h::;::- ■ ■ 




Number 








J3 CLK-torJo, 
J4_CLK for J4 




y- B2 


ycc5y 






GA(D 






LDO 


I/O 


iiilil 




LDl 


I/O 




LD2 


I/O 






LD3 


I/O 




LD4 


I/O 






LD5 


I/O 


-B6 ,. ■ : 


LD6 


I/O 




LD7 


I/O 


ii'- 'B- 1* 


LD8 


I/O 




Luy 


I/O 




LDIO 


I/O 




T TM 1 


T/O 




GND 






VLLjV 


PWR 

I VrZV 




LD12 


I/O 






LD13 


I/O 


to- 


LD14 


I/O 




LD15 


I/O 




LD16 


I/O 




LD17 


I/O 




LD18 


I/O 


ii;iySi%i;i 


LD19 


I/O 






LD20 


I/O 




LD21 


I/O 








pm 




LD22 






LD23 


I/O 




LD24 


T/n 




LD25 


I/O 


itf lIlS"iil 


j LD26 


T/n 
i/u 




■Pi' . 


i LD27 


I/O 


;:|fcy,:--iii:.' 


LD28 


T/n 




LD29 


uo 


SlSSfS 


1 LDiU 


T/n 




; LD31 

! — 


I/O 






vCC:> K 


PXV7? 


ii»2i 




LHOLD 


OT 




ADbN 


T/n 




GND 


PWR 




T^T7XT XT 

DEN_N 


nT 




J DTR N 


o 




LA31 






P LA30 


0 




,| LA2y 






' LA28 


o 


iiSI 


LAIO 


0 




t LA7 


0 




LA6 


0 




LAS 


0 




LA4 


0 




LA3 


0 
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Number 












^ A29 ,. . 


LA2 


0 ' 




DONE 


OD 




VCC5V 




■ B30 


VCC5V 


PWR 



I/O direction is for Board 1 



FIG. 43 shows a legend of the connectors Jl to J28 in FIGS. 41(A) to 41(F) and 42. In 
general, the clear filled blocks indicate surface mount, whereas the gray filled blocks represent 

5 the through hole types. Also, the solid outline block represents the connectors located on the 
component side. The dotted outline block represents the connectors located on the solder side. 
Thus, the block 1840 with the clear fill and the solid outline represents a 2x30 header, surface 
mount and located on the component side. Block 1841 with the clear fill and the dotted outline 
represents a 2x30 receptacle, surface mount and located on the solder side of the board. Block 

10 1842 with the gray fill and solid outline represents a 2x30 or 2x45 header, through hole and 

located on the component side. Block 1843 with the gray fill and the dotted outline represents a 
2x45 or 2x30 receptacle, through hole and located on the solder side. In one embodiment, the 
Shnulation system uses Samtec's SFM and TFM series of 2x30 or 2x45 micro strip connectors 
for both surface mount and through hole types. Block 1844 with the cross-hatched fill and the 

15 solid outline is an R-pack, surface mount and located on the component side of the board. Block 
1845 with the cross-hatched fill and the dotted outline is an R-pack, surface mount and located on 
the solder side. The Samtec specification firom Samtec's catalog on their website is incorporated 
by reference herein. Remrning to FIG. 42, connectors J3 to J28 are the type as indicated in the 
legend of FIG. 43. 

20 FIGS. 41(A) to 41(F) show top views of each board and their respective connectors. FIG. 

41(A) shows the connectors for board6. Thus, board 1660 contams connectors 1661 to 1681 
along with motherboard connector 1682. FIG. 41(B) shows the connectors for boardS. Thus, 
board 1690 contains connectors 1691 to 1708 along with motherboard connector 1709. FIG. 
41(C) shows the connectors for board4. Thus, board 1715 contains connectors 1716 to 1733 

25 along with motherboard connector 1734. FIG. 41(D) shows the connectors for boardS. Thus, 
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board 1740 contains connectors 1741 to 1758 along with motherboard connector 1759. FIG. 
41(E) shows the connectors for board2. Thus, board 1765 contains connectors 1766 to 1783 
along with motherboard connector 1784. FIG. 41(F) shows the connectors for boardl. Thus, 
board 1790 contains connectors 1791 to 1812 along with motherboard connector 1813. As 
5 indicated on the legend on FIG. 43, these connectors for the six boards are various combinations 
of (1) surface mount or through hole, (2) component side or solder side, and (3) header or 
receptacle or R-pack. 

In one embodiment, these connectors are used for inter-board communications. Related 
buses and signals are grouped together and supported by these inter-board connectors for routing 
CUO signals between any two boards. Also, only half of the boards are directly coupled to the 

motherboard. In FIG. 41(A), board6 1660 contains connectors 1661 to 1668 designated for one 
^2 set of the FPGA interconnects, connectors 1669 to 1674, 1676, and 1679 designated for another 
H set of FPGA interconnects, and connector 1681 designated for the local bus. Because board6 
m 1660 is positioned as one of the boards at the end of the motherboard (along with boardl 1790 in 

15 FIG. 41(F) at the other end), connectors 1675, 1677, 1678, and 1680 are designated for the 10- 
^ ohm R-pack connections for certain north-south mterconnects. Also, the motherboard connector 
ly 1682 is not used for board6 1660, as shown in FIG. 38(B) where the sixth board 1535 is coupled 
2 to the fifth board 1534 but not directly coupled to the motherboard 1520. 

In FIG. 41(B), board5 1690 contains connectors 1691 to 1698 designated for one set of 
20 the FPGA mterconnects, connectors 1699 to 1706 designated for another set of FPGA 

interconnects, and connectors 1707 and 1708 designated for the local bus. Connector 1709 is 
used to couple board5 1690 to the motherboard. 

In FIG. 41(C), board4 1715 contains connectors 1716 to 1723 designated for one set of 
the FPGA interconnects, connectors 1724 to 1731 designated for another set of FPGA 
25 interconnects, and connectors 1732 and 1733 designated for the local bus. Connector 1709 is not 
used to couple board4 1715 directly to the motherboard. This configuration is also shown in 
FIG. 38(B) where the fourth board 1533 is coupled to the third board 1532 and the fifth board 
1534 but not directly coupled to the motherboard 1520. 

In FIG. 41(D), board3 1740 contains connectors 1741 to 1748 designated for one set of 
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the FPGA interconnects, connectors 1749 to 1756 designated for another set of FPGA 
interconnects, and connectors 1757 and 1758 designated for the local bus. Connector 1759 is 
used to couple boards 1740 to the motherboard. 

In FIG. 41(E), board2 1765 contains connectors 1766 to 1773 designated for one set of 
5 the FPGA interconnects, connectors 1774 to 1781 designated for another set of FPGA 

interconnects, and connectors 1782 and 1783 designated for the local bus. Connector 1784 is not 
used to couple board2 1765 directly to the motherboard. This configuration is also shown in 
FIG. 38(B) where the second board 1525 is coupled to the third board 1532 and the first board 
1526 but not directly coupled to the motherboard 1520. 
OlO In FIG. 41(F), boardl 1790 contains connectors 1791 to 1798 designated for one set of 

5 the FPGA interconnects, connectors 1799 to 1804, 1806, and 1809 designated for another set of 
'i; FPGA interconnects, and connectors 1811 and 1812 designated for the local bus. Connector 

1813 is used to couple boardl 1790 to the motherboard. Because boardl 1790 is positioned as 
Lfl one of the boards at the end of the motherboard (along with board6 1660 in FIG. 41(A) at the 
J-i 15 other end), connectors 1805, 1807, 1808, and 1810 are designated for the 10-ohm R-pack 

connections for certain north-south interconnects. 
n In one embodiment of the present invention, multiple boards are coupled to the 

U motherboard and to each other in a unique manner. Multiple boards are coupled together 
component-side to solder-side. One of the boards, say the first board, is coupled to the 
20 motherboard and hence, the PCI bus, via a motherboard connector. Also, the FPGA interconnect 
bus on the first board is coupled to the FPGA interconnect bus of the other board, say the second 
board, via a pair of FPGA interconnect connectors. The FPGA interconnect connector on the 
first board is on the component side and the FPGA interconnect connector on the second board is 
on the solder side. The component-side and solder-side connectors on the first board and second 
25 board, respectively, allow the FPGA interconnect buses to be coupled together. 

Similarly, the local buses on the two boards are coupled together via local bus connectors. 
The local bus connector on the first board is on the component side and the local bus connector 
on the second board is on the solder side. Thus, the component-side and solder-side connectors 
on the first board and second board, respectively, allow the local buses to be coupled together. 
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More boards can be added. A third board can be added with its solder-side to the 
component-side of the second board. Similar FPGA interconnects and local bus inter-board 
connections are also made. The third board is also coupled to the motherboard via another 
connector but this connector merely provides power and ground to the third board, to be 
5 discussed further below. 

The component-side to solder-side connectors in the dual board configuration will be 
discussed with reference to FIG. 38(A). This figure shows side views of the FPGA board 
connection on the motherboard in accordance with one embodnnent of the present invention. 
FIG. 38(A) shows the dual-board configuration where, as the name implies, only two boards are 
So utilized. These two boards 1525 (board!) and 1526 (board!) in FIG. 38(A) coincide with the two 
^2 boards 1552 and 1551 in FIG. 39. The component sides of the boards 1525 and 1526 are 
J represented by reference numeral 1989. The solder side of the two boards 1525 and 1526 are 
i2 represented by reference numeral 1988. As shown in FIG. 38(A), these two boards 1525 and 

1526 are coupled to the motherboard 1520 via motherboard connector 1523. Other motherboard 
C3l5 connectors 1521, 1522, and 1524 can also be provided for expansion purposes. Signals between 
|i the PCI bus and the boards 1525 and 1526 are routed via the motherboard connector 1523. PCI 
;J signals are routed between the dual-board structure and the PCI bus via the first board 1526 first. 
Thus, signals fi-om the PCI bus encounter the first board 1526 first before they travel to the 
second board 1525. Analogously, signals to the PCI bus from the dual-board structure are sent 
20 from the first board 1526. Power is also applied to the boards 1525 and 1526 via motherboard 
connector 1523 from a power supply (not shown). 

As shown in FIG. 38(A), board 1526 contains several components and connectors. One 
such component is an FPGA logic device 1530. Connectors 1528A and 1531A are also provided. 
Similarly, board 1525 contains several components and connectors. One such component is an 
25 FPGA logic device 1529. Connectors 1528B and 153 IB are also provided. 

In one embodiment, coimectors 1528A and 1528B are the inter-board connectors for the 
FPGA bus such as 1590 and 1581 (FIG. 44). These mter-board connectors provide the inter- 
board connectivity for the various FPGA interconnects, such as N[73:0], S[73;0], W[73:0], 
E[73:0], NH[27:0], SH[27:0], XH[36:0] and XH[72:37], excluding the local bus connections. 
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Furthermore, connectors 1531 A and 153 IB are the inter-board connectors for the local 
bus. The local bus handles the signals between the PCI bus (via the PCI controller) and the 
FPGA bus (via the FPGA I/O controller (CTRL FPGA) unit)). The local bus also handles 
configuration and boundary scan test information between the PCI controller and the FPGA logic 
5 devices and the FPGA I/O controller (CTRL_^FPGA) unit. 

In sum, the motherboard connector couples one board in a pair of boards to the PCI bus 
and power. One set of connectors couples the FPGA interconnects via the component side of one 
board to the solder side of the other board. Another set of connectors couples the local buses via 
the component side of one board to the solder side of the other board, 
y 0 In another embodiment of the present invention, more than two boards are used. Indeed, 

tfj FIG. 38(B) shows a six-board configuration. The configuration is analogous to that of FIG. 

38(A), in which every other board is directly connected to the motherboard, and interconnects 
: and local buses of these boards are coupled together via inter-board connectors arranged solder- 
in side to component-side. 

□15 FIG. 38(B) shows six boards 1526 (first board), 1525 (second board), 1532 (third board), 

?J 1533 (fourth board), 1534 (fifth board), and 1535 (sixth board). These six boards are coupled to 
ty the motherboard 1520 via the connectors on boards 1526 (first board), 1532 (third board), and 
U 1534 (fifth board). The other boards 1525 (second board), 1533 (fourth board), and 1535 (sixth 
board) are not directly coupled to the motherboard 1520; rather, they are indirectly coupled to the 
20 motherboard through their respective connections to their respective neighbor boards. 

Placed solder-side to component-side, the various inter-board connectors allow 
communication among the PCI bus components, the FPGA logic devices, memory devices, and 
various Simulation system control circuits. The first set of inter-board connectors 1990 
correspond to connectors J5 to J16 in FIG. 42. The second set of inter-board connectors 1991 
25 correspond to connectors J17 to J28 in FIG. 42, The third set of inter-board connectors 1992 
correspond to comectors J3 and J4 in FIG. 42. 

Motherboard connectors 1521 to 1524 are provided on the motherboard 1520 to couple 
the motherboard (and hence the PCI bus) to the six boards. As mentioned above, boards 1526 
(first board), 1532 (third board), and 1534 (fifth board) are directly coupled to the connectors 
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1523, 1522, and 1521, respectively. The other boards 1525 (second board), 1533 (fourth board), 
and 1535 (sixth board) are not directly coupled to the motherboard 1520. Because only one PCI 
controller is needed for all six boards, only the first board 1526 contains a PCI controller. Also, 
the motherboard connector 1523 which is coupled to the first board 1526 provides access to/from 
5 the PCI bus. Connectors 1522 and 1521 are only coupled to power and ground. The center-to- 
center spacing between adjacent motherboard connectors is approximately 20.32 mm in one 
embodiment. 

For the boards 1526 (first board), 1532 (third board), and 1534 (fifth board) that are 
directly coupled to the motherboard connectors 1523, 1522, and 1521, respectively, the J5 to J16 
plO connectors are located on the component side, the J 17 to J28 connectors are located on the solder 
% side, and the J3 to J4 local bus connectors are located on the component side. For the other 
^ boards 1525 (second board), 1533 (fourth board), and 1535 (sixth board) that are not directly 
H coupled to the motherboard connectors 1523, 1522, and 1521, the J5 to J16 connectors are 
\n located on the solder side, the J17 to J28 connectors are located on the component side, and the 
!™ 15 J3 to J4 local bus connectors are located on the solder side. For the end boards 1526 (first board) 
S and 1535 (sixth board), parts of the J17 to J28 connectors are 10-ohm R-pack terminations. 

FIGS. 40(A) and 40(B) show array connection across different boards. To facilitate the 
?J manufacturing process, a single layout design is used for all the boards. As explained above, 
boards connect to other boards through connectors without a backplane. FIG. 40(A) shows two 
20 exemplary boards 1611 (board2) and 1610 (boardl). The component side of board 1610 is facing 
the solder side of board 1611. Board 1611 contains numerous FPGA logic devices, other 
components, and wire lines. Particular nodes of these logic devices and other components on 
board 1611 are represented by nodes A' (reference numeral 1612) and B' (reference numeral 
1614). Node A' is coupled to connector pad 1616 via PCB trace 1620. Similarly, node B' is 
25 connected to connector pad 1617 via PCB trace 1623. 

Analogously, board 1610 also contains numerous FPGA logic devices, other components, 
and wire lines. Particular nodes of these logic devices and other components on board 1610 are 
represented by nodes A (reference numeral 1613) and B (reference numeral 1615). Node A is 
coupled to connector pad 1618 via PCB trace 1625, Sunilarly, node B is connected to connector 
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pad 1619 via PCB trace 1622. 

The routing of signals between nodes located in different boards using surface mount 
connectors will now be discussed. In FIG. 40(A), the desired connections are between: (1) node 
A and node B' as indicated by imaginary path 1623, 1624, and 1625, and (2) node B and node A' 
5 as indicated by imaginary path 1620, 1621, and 1622. These connections are for paths such as 
the asymmetric interconnect 1600 between board 1551 and board 1552 in FIG. 39. Other 
asymmetric interconnects include the NH to SH mterconnects 1977, 1979, and 1981 on both sides 
of connectors 1589 and 1590. 

A-A' and B-B' correspond to symmetrical interconnections like interconnect 1515 (N, S). 
ClO N and S interconnections use through hole connectors, whereas NH and SH asymmetric 
^0 intercoimections use SMD connectors. Refer to Table D. 

■ g The actual unplementation using surface mount connectors will now be discussed with 

,f reference to FIG. 40(B) using like numbers for like items. In FIG. 40(B), board 1611 shows 
in node A' on the component side coupled to component-side connector pad 1636 via PCB trace 
ol5 1620. The component-side connector pad 1636 is coupled to the solder-side connector pad 1639 

via conductive path 1651. Solder-side connector pad 1639 is coupled to the component-side 
[U connector pad 1642 on board 1610 via conductive path 1648. Finally, component-side connector 
pad 1642 is coupled to node B via PCB trace 1622. Thus, node A' on board 1611 can be coupled 
to node B on board 1610. 
20 Likewise, in FIG. 40(B), board 1611 shows node B' on the component side coupled to 

component-side connector pad 1638 via PCB trace 1623. The component-side connector pad 
1638 is coupled to the solder-side connector pad 1637 via conductive path 1650. Solder-side 
connector pad 1637 is coupled to the component-side connector pad 1640 via conductive path 
1645. Finally, component-side connector pad 1640 is coupled to node A via PCB trace 1625. 
25 Thus, node B' on board 1611 can be coupled to node A on board 1610. Because these boards 
share the same layout, conductive paths 1652 and 1653 could be used in the same manner as 
conductive paths 1650 and 1651 for other boards placed adjacent to board 1610. Thus, a unique 
inter-board connectivity scheme is provided using surface mount and through hole connectors 
without using switching components, 
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F, TIMING-INSENSITIVE GLITCH-FREE LOGIC DEVICES 
One embodiment of the present invention solves both the hold time and clock glitch 
problems. During configuration of the user designs into the hardware model of the 
5 reconfigurable computing system, standard logic devices (e.g., latches, flip-flops) found in the 
user designs are replaced with emulation logic devices, or timing-insensitive glitch-free (TIGF) 
logic devices, in accordance with one embodiment of the present mvention. In one embodiment, 
a trigger signal that has been incorporated into the --EVAL signal is used to update the values 
stored in these TIGF logic devices. After waiting for the various input and other signals to 
plO propagate through the hardware model of the user design and reach steady-state during the 
% evaluation period, the trigger signal is provided to update the values stored or latched by the 

TIGF logic devices. Thereafter, a new evaluation period begins. This evaluation period-trigger 
' 4 period is cyclical, in one embodiment. 

In The hold time problem mentioned above will now be briefly discussed. As known to 

15 those ordinarily skilled in the art, a common and pervasive problem in logic circuit design is hold 
^3 time violation. Hold time is defined as the minimum amount of time that the data input(s) of a 
fll logic element must be held stable after the control input (e.g., clock input) changes to latch, 
iT capture or store the value indicated by the data uiput(s); otherwise, the logic element will fail to 
work properly. 

20 A shift register example will now be discussed to illustrate the hold time requirement. 

FIG. 75(A) shows an exemplary shift register in which three D-type flip-flops are connected 
serially; that is, the output of flip-flop 2400 is coupled to the input of flip-flop 2401, whose 
output is in turn coupled to the input of flip-flop 2402. The overall input signal Si^ is coupled to 
the input of flip-flop 2400 and the overall output signal S^^^ is generated from the output of flip- 

25 flop 2402. All three flip-flops receive a common clock signal at their respective clock inputs. 
This shift register design is based on the assumption that (1) the clock signal will reach all the 
flip-flops at the same time, and (2) after detecting the edge of the clock signal, the input of the 
flip-flop will not change for the duration of the hold time. 

Referring to the timing diagram of FIG. 75(B), the hold time assumption is illustrated 
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where the system does not violate hold time requirements. The hold time varies from one logic 
element to the next but is alv^ays specified in the specification sheets. The clock mput changes 
from logic 0 to logic 1 at time t^. As shown in FIG. 75(A), the clock input is provided to each 
flip-flop 2400-2402. From this clock edge at to, the input S-^^ must be stable for the duration of 
5 the hold tune Th, which lasts from time to to tune t^. Similarly, the inputs to flip-flops 2401 (i.e., 
D2) and 2402 (i.e. , D3) must also be stable for the duration of the hold time from the trigger edge 
of the clock signal. Since this requirement is satisfied in FIGS. 75(A) and 75(B), input Sin is 
shifted into flip-flop 2400, input at D2 (logic 0) is shifted into flip-flop 2401, and input at D3 
(logic 1) is shifted into flip-flop 2402. As known to those ordinarily skilled in the art, after the 
Oio clock edge has been triggered, the new values at the input of flip-flop 2401 (logic 1 at input D2) 
a and flip-flop 2402 (logic 0 at input D3) will be shifted into or stored in the next flip-flop at the 
]^ next clock cycle assuming hold tune requirements are satisfied. The table below summarizes the 
operation of the shift register for these exemplary values: 





Di 




D3 


Q3 


Before clock edge 


1 


0 


1 


0 


After clock edge 


1 


1 


0 


1 



In an actual implementation, the clock signal will not reach all the logic elements at the 
same time; rather, the circuit is designed such that the clock signal will reach all the logic elements 
in almost the same time or substantially the same time. The circuit must be designed such that the 
clock skew, or the timing difference between the clock signals reaching each flip-flop, is much 

20 smaller than the hold time requirement. Accordingly, all the logic elements will capture the 
appropriate input values. In the example above illustrated in FIGS. 75(A) and 75(B), hold time 
violation due to clock signals arriving at different times at the flip-flops 2400-2402 may resuh in 
some flip-flops capturing the old input values while another flip-flop capturing a new input value. 
As a result, the shift register will not operate properly. 

25 In a reconfigurable logic (e.g., FPGA) implementation of the same shift register design, if 

the clock is duectly generated from a primary input, the circuit can be designed so that the low skew 
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network can distribute the clock signal to all the logic elements such that the logic elements will 
detect the clock edge at substantially the same time. Primary clocks are generated from self-timed 
test-bench processes. Usually, the primary clock signals are generated in software and only a few 
(i.e., 1-10) primary clocks are found in a typical user circuit design. 
5 However, if the clock signal is generated from internal logic instead of the primary input, 

hold time becomes more of an issue. Derived or gated clocks are generated from a network of 
combinational logic and registers that are in turn driven by the primary clocks. Many (i.e., 1,000 or 
more) derived clocks are found in a typical user circuit design. Without extra precautions or 
additional controls, these clock signals may reach each logic element at different times and the clock 
JjlO skew may be longer than the hold time. This may result in the failure of a circuit design, such as the 

shift register circuit illustrated in FIGS. 75(A) and 75(B). 
^1} Using the same shift register circuit illustrated in FIG. 75(A), hold time violation will now 

'"i be discussed. This time, however, the individual flip-flops of the shift register circuit are spread out 
: n across multiple reconfigurable logic chips (e.g., multiple FPGA chips) as shown in FIG. 76(A). The 

15 first FPGA chip 241 1 contains the internally derived clock logic 2410 which will feed its clock 
& signal CLK to some components of FPGA chips 2412-2416. In this example, the internally 
m generated clock signal CLK will be provided to flip-flops 2400-2402 of the shift register circuit. 
ti Chip 2412 contains flip-flop 2400, chip 2415 contains flip-flop 2401, and chip 2416 contains flip- 
flop 2402. Two other chips 2413 and 2414 are provided to illustrate the hold time violation 
20 concept. 

The clock logic 2410 in chip 241 1 receives a primary clock input (or possibly another 
derived clock input) to generate an internal clock signal CLK. This internal clock signal CLK will 
travel to chip 2412 and is labeled CLKl. The internal clock signal CLK from clock logic 2410 will 
also travel to chip 2415 as CLK2 via chips 2413 and 2414. As shown, CLKl is input to flip-flop 
25 2400 and CLK2 is input to flip-flop 2401 . Both CLKl and CLK2 experience wire trace delays such 
that the edges of CLKl and CLK2 will be delayed from the edge of the internal clock signal CLK. 
Furthermore, CLK2 will experience additional delays because it traveled through two other chips 
2413 and 2414. 
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Referring to the timing diagram of FIG. 76(B), the internal clock signal CLK is generated 
and triggered at time t2. Because of wire trace delays, CLKl does not arrive at flip-flop 2400 in 
chip 2412 until time t^, which is a delay of time Tl. As shown in the table above, the output at 
(or input D2) is at logic 0 before the arrival of the clock edge of CLKl . After the edge of CLKl is 
5 sensed at flip-flop 2400, the input at must remain stable for the requisite hold time H2 (i.e., until 
time t4). At this point, flip-flop 2400 shifts in or stores the input logic 1 so that the output at Qj (or 
D2) is at logic 1 . 

While this is taking place for flip-flop 2400, the clock signal CLK2 is making its way to flip- 
flop 2401 in chip 2415. The delay T2 caused by chips 2413 and 2414 were such that CLK2 arrived 
C310 at flip-flop 2401 at time The input at D2 is now at logic 1 and after the hold time has been 
5 satisfied for this flip-flop 2401, this logic value 1 will appear at the output Q2 (or D3). Thus, the 
%: output Q2 was at logic 1 before the arrival of CLK2 and the output continues to be at logic 1 after 
the arrival of CLK2. This is an incorrect result. This shift register should have shifted in logic 0. 
In While flip-flop 2400 correctly shifted in the old input value (logic 1), the flip-flop 2401 incorrectly 
15 shifted in the new input value (logic 1). This incorrect operation typically results when the clock 
skew (or timing delay) is greater than the hold time. In this example, T2>T1+H2, In sum, hold 
flJ time violations are likely to occur where the clock signal is generated fi-om one chip and distributes 
iT it to the other logic elements that reside in different chips, as shown in FIG. 76(A), unless some 
precautionary measures are taken. 
20 The clock glitch problem mentioned above will now be discussed with reference to FIGS. 

77(A) and 77(B). Generally, when the inputs of a circuit change, the outputs change to some 
random value for some very brief time before they settle down to the correct value. If another 
circuit inspects the output at just the wrong time and reads the random value, the results can be 
incorrect and difficult to debug. This random value that detrimentally affected another circuit is 
25 called a glitch. In typical logic circuits, one circuit may generate the clock signal for another 
circuit. If uncompensated timing delays exist in one or both circuits, a clock glitch (i.e., an 
unplanned occurrence of a clock edge) may be generated which may cause an incorrect result. 
Like hold time violation, clock glitches arise because certain logic elements in the circuit design 
change values at different times. 
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FIG. 77(A) shows an exemplary logic circuit where some logic elements generate a clock 
signal for another set of logic elements; that is, D-type flip-flop 2420, D-type flip-flop 2421, and 
exclusive-or (XOR) gate 2422 generate a clock signal (CLK3) for D-type flip-flop 2423. Flip-flop 

2420 receives its data input at Dj on line 2425 and outputs data at on line 2427. It receives its 
5 clock input (CLKl) from a clock logic 2424. CLK refers to the originally generated clock signal 

from the clock logic 2424 and CLKl refers to the same signal that is delayed in time when it 
reaches flip-flop 2420. 

Flip-flop 2421 receives its data input at D2 on line 2426 and outputs data at Q2 on line 2428. 
It receives its clock input (CLK2) from a clock logic 2424. As mentioned above, CLK refers to the 
QlO originally generated clock signal from the clock logic 2424 and CLK2 refers to the same signal that 
a is delayed in time when it reaches flip-flop 2421 . 

J The outputs from flip-flops 2420 and 2421 on Hnes 2427 and 2428, respectively, are inputs 

^"j to XOR gate 2422, XOR gate 2422 outputs data labeled as CLK3 to the clock input of flip-flop 

in 2423. Flip-flop 2423 also inputs data at D3 on line 2429 and outputs data at Q3. 

Q 15 The clock glitch problem that may mise for this circuit will now be discussed with reference 

f ; to the timing diagram illustrated in FIG. 77(B). The CLK signal is triggered at time to. By the time 

fy this clock signal (i.e., CLKl) reaches flip-flop 2420, it is akeady time tj. CLK2 does not reach flip- 
fa 

U flop 2421 until time t2. 

Assume that the inputs to Dj and D2 are both at logic 1 . When CLKl reaches flip-flop 2420 
20 at time t^, the output at will be at logic 1 (as shown in FIG. 77(B)). CLK2 arrives at flip-flop 

2421 a httle late at time ts and thus, the output Q2 on line 2428 remains at logic 0 from time tj to 
time t2. The XOR gate 2422 generates a logic 1 as CLK3 for presentation to the clock input of flip- 
flop 2423 during the time period between time tj and time ts, even though the desired signal is a 
logic 0 (1 XOR 1 = 0), This generation of CLK3 during this time period between time tj and time t2 

25 is a clock gUtch. Accordingly, whatever logic value is present at D3 on input line 2429 of flip-flop 
2423 is stored whether this is desired or not, and this flip-flop 2423 is now ready for the next input 
on line 2429. If properly designed, the time delay of CLKl and CLK2 would be minimized such 
that no clock glitch would be generated, or at the very least, the clock glitch would last for such a 
short duration that it would not impact the rest of the circuit. In the latter case, if the clock skew 
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between CLKl and CLK2 is short enough, the XOR gate delay will be long enough to filter out the 
glitch and would not impact the rest of the circuit. 

Two known solutions to the hold time violation problem are (1) timing adjustment, and (2) 
timing resynthesis. Timing adjustment, discussed in U.S. Patent No. 5,475,830, requires the 
5 insertion of sufficient delay elements (such as buffers) in certain signal paths to prolong the hold 
time of the logic elements. For example, adding sufficient delay on the inputs D2 and D3 in the shift 
register circuit above may avoid hold time violation. Thus, in FIG. 78, the same shift register 
circuit is shown with delay elements 2430 and 2431 added to the inputs and D3, respectively. As 
a result, the delay element 2430 can be designed such that time occurs after time so that 
QlO T2<T1+H2 (FIG. 76(B)), and hence, no hold time violation will occur, 

m A potential problem with the timing adjustment solution is that it relies on the specification 

'i 

]p sheet of the FPGA chips too heavily. As known to those skilled in the art, reconfigurable logic 
, " chips, like FPGA chips, implement logic elements with look-up tables. The delay of look-up tables 
in in the chips is provided in the specification sheets and designers using the timing adjustment 
0 15 method of avoiding hold time violations rely on this specified time delay. However, this delay is 
[2 just an estimate and varies firom chip to chip. Another potential problem with the timing adjustment 
y method is that designers must also compensate for the wiring delays present throughout the circuit 
1^ design. Although this is not an impossible task, the estimation of wiring delay is time-consuming 
and prone to errors. Moreover, the timing adjustment method does not solve clock glitch problems. 
20 Another solution is timing resynthesis, a technique introduced by IKOS's VirtualWires 

technology. The timing resynthesis concept involves transforming a user's circuit design into a 
fimctionally equivalent design while strictly controlling the timing of clock and pin-out signals via 
finite state machines and registers. Timing resynthesis retimes a user's circuit design by introducing 
a single high speed clock. It also converts latches, gated clocks, and multiple synchronous and 
25 asynchronous clocks into a flip-flop based single-clock synchronous design. Thus, timing 

resynthesis uses registers at the input and output pin-outs of each chip to control the precise inter- 
chip signal movement so that no inter-chip hold-time violation will occur. Timing resynthesis also 
uses a finite state machine in each chip to schedule inputs firom other chips, schedule outputs to 
other chips, and schedule updates of internal flip-flops based on the reference clock. 

139 

SV/225583.01 
16503302504 



Using the same shift register circuit introduced in the discussion above associated with 
FIGS. 75(A), 75(B), 76(A), and 76(B), FIG. 79 shows one example of the timing resynthesis circuit. 
The basic three flip-flop shift register design has been transformed into a functionally equivalent 
circuit. Chip 2430 includes the original internal clock generating logic 2435 coupled to a register 
5 2443 via line 2448. The clock logic 2435 generates the CLK signal A first finite state machine 
2438 is also coupled to the register 2443 via line 2449. Both the register 2443 and the first finite 
state machine 2438 are controlled by a design-independent global reference clock. 

The CLK signal is also delivered across chips 2432 and 2433 before it arrives at chip 2434. 
In chip 2432, a second finite state machine 2440 controls a register 2445 via line 2462. The CLK 
yiO signal travels to register 2445 via line 2461 fi*om register 2443. Register 2445 outputs the CLK 
d signal to the next chip 2433 via line 2463. Chip 2433 includes a third finite state machine 2441 
„E which controls a register 2446 via line 2464. The register 2446 outputs the CLK signal to chip 
2 2434. 

Chip 243 1 includes the original flip-flop 2436. A register 2444 receives the input 8^^ and 
O 15 outputs the input S^^ to the input of flip-flop 2436 via line 2452. The Qi output of the flip-flop 
|T 2436 is coupled to register 2466 via fine 2454. A fourth finite state machine 2439 controls the 

register 2444 via line 2451, register 2466 via line 2455, and the flip-flop 2436 via the latch enable 
M= line 2453. The fourth finite state machine 2439 also receives the original clock signal CLK from 

chip 2430 via line 2450. 

20 Chip 2434 includes the original flip-flop 2437, which receives the signal fi-om register 2466 

in the chip 2431 at its D2 input via line 2456. The Q2 output of the flip-flop 2437 is coupled to 
register 2447 via line 2457. A fifth finite state machine 2439 controls the register 2447 via line 
2459, and the flip-flop 2437 via the latch enable line 2458. The fifth finite state machine 2442 also 
receives the original clock signal CLK from chip 2430 via chips 2432 and 2433. 

25 With timing resynthesis, the finite state machines 2438-2442, registers 2443-2447 and 2466, 

and the single global reference clock are used to control signal flow across multiple chips and 
update internal flip-flops. Thus, in chip 2430, the distribution of the CLK signal to other chips is 
scheduled by the first finite state machine 2438 via the register 2443. Similarly, in chip 243 1, the 
fourth finite state machine 2439 schedules the delivery of the input to the flip-flop 2436 via 
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register 2444 as well as the output via register 2466. The latching function of the flip-flop 2436 
is also controlled by a latch enable signal from the fourth finite state machine 2439. The same 
principle holds for the logic in the other chips 2432-2434. With such tight control of inter-chip 
input dehvery schedule, inter-chip output delivery schedule, and internal flip-flop state updating, 
5 inter-chip hold-time violations are eliminated. 

However, the timing resynthesis technique requires the transformation of the user's circuit 
design into a much larger functionally equivalent circuit including the addition of finite state 
machines and registers. Typically, the additional logic necessary to implement this technique takes 
up to 20% of the useful logic in each chip. Furthermore, this technique is not immune to clock 
% 10 glitch problems. To avoid clock glitches, designers using the timing resynthesis technique must 
}3 take additional precautionary steps. One conservative design approach is to design the circuit so 
£ that the inputs to a logic device utilizing gated clocks are not changed at the same time. An 
l2 aggressive approach uses the gate delays to filter the glitches so that they do not impact the rest of 

" the circuit. However, as stated above, timing resynthesis requires some additional non-trivial 
O 15 measures to avoid clock glitches. 

|i The various embodiments of the present invention, which solve both the hold time and 

clock glitch problems, will now be discussed. During configuration mapping of tfie user design 
into the software model of the RCC computmg system and the hardware model of the RCC array, 
latches shown in FIG. 18(A) are emulated with a timing insensitive glitch-free (TIGF) latch in 
20 accordance with one embodiment of the present invention. Similarly, design flip-flops shown in 
FIG. 18(B) are emulated with a TIGF flip-flop in accordance with one embodiment of the present 
invention. These TIGF logic devices, whether in the form of a latch or flip-flop, can also be 
called emulation logic devices. The updates of the TIGF latches and flip-flops are controlled with 
a global trigger signal. 

25 In one embodiment of the present invention, not all of the logic devices found in the user 

design circuit are replaced with the TIGF logic devices. A user design circuit includes those 
portions that are enabled or clocked by the primary clocks and other portions that are controlled 
by gated or derived clocks. Because hold time violations and clock glitches are issues for the 
latter case where logic devices are controlled by gated or derived clocks, only these particulare 
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logic devices that are controlled by gated or derived clocks are replaced with the TIGF logic 
devices in accordance with the present invention. In other embodiments, all logic devices found 
in the user design circuit are replaced with the TIGF logic devices. 

Before discussing the TIGF latch and flip-flop embodiments of the present invention, the 
5 global trigger signal will be discussed. Generally, the global trigger signal is used to allow the 
TIGF latches and flip-flops to keep its state (i.e., keep the old input value) during the evaluation 
period and update its state (i.e., store the new input value) during a short trigger period. In one 
embodiment, the global trigger signal, shown in FIG. 82, is separate from and derived from the 
-'EVAL signal discussed above. In this embodiment, the global trigger signal has a long 

DiO evaluation period followed by a short trigger period. The global trigger signal tracks the -EVAL 

\n signal during the evaluation period and at the conclusion of the EVAL cycle, a short trigger 
signal is generated to update the TIGF latches and flip-flops. In another embodiment, the 
--EVAL signal is the global trigger signal, where the '-EVAL signal is at one logic state (e.g. , 

in logic 0) during the evaluation period and at another logic state (e.g., logic 1) during non- 

p 15 evaluation or TIGF latch/flip-flop update periods. 

r-^ The evaluation period, as discussed above with respect to the RCC computing system and 

[y RCC hardware array, is used to propagate all the primary inputs and flip-flop/latch device 

changes into the entire user design, one simulation cycle at a time. During the propagation, the 
RCC system waits until all the signals in the system achieve steady-state. The evaluation period 
20 is calculated after the user design has been mapped and placed into the appropriate reconfigurable 
logic devices (e.g., FPGA chips) of the RCC array. Accordingly, the evaluation period is 
design-specific; that is, the evaluation period for one user design may be different from the 
evaluation period for another user design. This evaluation period must be long enough to assure 
that all the signals in the system are propagated through the entire system and reach steady-state 
25 before the next short trigger period. 

The short trigger period occurs adjacent in time to the evaluation period, as shown in 
FIG. 82. In one embodiment, the short trigger period occurs after the evaluation period. Prior 
to this short trigger period, the input signals are propagated throughout the hardware model- 
configured portion of the user design circuit during the evaluation period. The short trigger 
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period, marked by a change in the logic state of the --EVAL signal in accordance with one 
embodiment of the present invention, controls all the TIGF latches and flip-flops in the user 
design so that they can be updated with the new values that have been propagated from the 
evaluation period after steady-state has been achieved. This short trigger period is globally 
5 distributed with a low skew network and can be as short (i.e. , duration from t^ to ti, as well as 
duration ti to t^, as shown in FIG. 82) as the reconfigurable logic devices will allow for proper 
operation. During this short trigger period, the new primary mputs are sampled at every input 
stage of the TIGF latches and flip-flops and the old stored values at the same TIGF latches and 
flip-flops are exported out to the next stage in the RCC hardware model of the user design. In 
[jlO the discussion below, the portion of the global trigger signal that occurs during the short trigger 
^ period will be referred to as the TIGF trigger, TIGF trigger signal, trigger signal, or shnply the 
^jj trigger. 

H FIG. 80(A) shows the latch 2470 originally shown in FIG. 18(A). This latch operates as 

m follows: 

h 

^ if (#S), 1 

y else if (#R), Q^O 

J else if (en), Q^D 

else Q keeps the old value. 

20 

Because this latch is level-sensitive and asynchronous, so long as the clock input is enabled and 
the latch enable input is enabled, the output Q tracks the input D. 

FIG, 80(B) shows the TIGF latch in accordance with one embodiment of the present 
invention. Like the latch of FIG. 80(A), the TIGF latch has a D input, an enable input, a set (S), 
25 a reset (R), and an output Q. Additionally, it has a trigger input. The TIGF latch includes a D 
flip-flop 2471, a multiplexer 2472, an OR gate 2473, an AND gate 2474, and various 
intercoimections. 

D flip-flop 2471 receives its input from the output of AND gate 2474 via line 2476. The 
D flip-flop is also triggered at its clock input by a trigger signal on line 2477, which is globally 
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distributed by the RCC system in accordance with a strict schedule dependent on the evaluation 
cycle. The output of D flip-flop 2471 is coupled to one input of multiplexer 2472 via line 2478. 
The other input of multiplexer 2472 is coupled to the TIGF latch D input on line 2475. The 
multiplexer is controlled by an enable signal on line 2484. The output of the multiplexer 2472 is 
5 coupled to one input of OR gate 2473 via line 2479. The other input of OR gate 2473 is coupled 
to the set (S) input on line 2480. The output of the OR gate 2473 is coupled to one input of AND 
gate 2474 via line 2481. The other input of AND gate 2474 is coupled to the reset (R) signal on 
line 2482. The output of AND gate 2474 is fed back to the input of the D flip-flop 2471 via line 
2476, as mentioned above. 
10 The operation of this TIGF latch embodnnent of the present invention will now be 

discussed. In this embodiment of the TIGF latch, the D flip-flop 2471 holds the current state 
(i.e., old value) of the TIGF latch. Line 2476 at the input of D flip-flop 2471 presents the new 
input value that has yet to be latched into the TIGF latch. Line 2476 presents the new value 
because the main input (D input) of the TIGF latch on line 2475 ultimately makes its way from 
15 the input of the multiplexer 2472 (with the proper enable signal on line 2484, which will 

ultimately be presented) through the OR gate 2473, and finally through the AND gate 2474 onto 
line 2483, which feeds back the new input signal of the TIGF latch to the D flip-flop 2471 on line 
2476. A trigger signal on line 2477 updates the TIGF latch, by clocking the new input value on 
line 2476 into the D flip-flop 2471 . Thus, the output on line 2478 of the D flip-flop 2471 
20 indicates the current state (i.e. , old value) of the TIGF latch, while the input on line 2476 
indicates the new input value that has yet to be latched by the TIGF latch. 

The multiplexer 2472 receives the current state from D flip-flop 2471 as well as the new 
input value on line 2475. The enable line 2484 functions as the selector signal for the multiplexer 
2472. Because the TIGF latch will not update (i.e., store new input value) until the trigger signal 
25 is provided on line 2477, the D input of the TIGF latch on line 2475 and the enable input on line 
2484 can arrive at the TIGF latch in any order. If this TIGF latch (and other TIGF latches in the 
hardware model of the user design) encounters a situation that would normally cause hold time 
violation in a circuit that used a conventional latch, such as in the discussion above with respect 
to FIGS. 76(A) and 76(B) where one clock signal arrived much later than another clock signal, 
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this TIGF latch will function properly by keeping the proper old value until the trigger signal is 
provided on line 2477. 

The trigger signal is distributed through the low-skew global clock network. 

This TIGF latch also solves the clock glitch problem. Note that the clock signal is 
5 replaced by the enable signal in the TIGF latch. The enable signal on line 2484 can glitch often 
during the evaluation period but the TIGF latch will contuiue to hold the current state without 
fail. The only mechanism by which the TIGF latch can be updated is through the trigger signal, 
which is provided after the evaluation period, in one embodiment, when the signals have attained 
steady-state. 

10 FIG. 81(A) shows a flip-flop 2490 originally shown in FIG. 18(B). This flip-flop 

operates as follows: 



if(#S), 1 
else if (#R), Q^O 
15 else if (positive edge of CLK), Q ^ D 

else Q keeps the old value. 



Because this latch is edge-triggered, so long as flip-flop enable input is enabled, the output Q 
tracks the input D at the positive edge of the clock signal. 

20 FIG. 81(B) shows the TIGF D-type flip-flop in accordance with one embodiment of the 

present invention. Like the flip-flop of FIG. 81(A), the TIGF flip-flop has a D input, a clock 
uiput, a set (S), a reset (R), and an output Q. Additionally, it has a trigger input. The TIGF flip- 
flop mcludes three D flip-flops 2491, 2492, and 2496, a multiplexer 2493, an OR gate 2494, two 
AND gates 2495 and 2497, and various interconnections. 

25 Flip-flop 2491 receives the TIGF D input on line 2498, the trigger input on line 2499, and 

provides a Q output on line 2500. This output line 2500 also serves as one of the inputs to 
multiplexer 2493. The other input to the muhiplexer 2493 comes from the Q output of flip-flop 
2492 via line 2503. The output of multiplexer 2493 is coupled to one of the inputs of OR gate 
2494 via line 2505. The other input of OR gate 2492 is the set (S) signal on line 2506. The 
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output of OR gate 2494 is coupled to one of the inputs of AND gate 2495 via line 2507. The 
other input of AND gate 2495 is the reset (R) signal on line 2508. The output of AND gate 2495 
(which is also the overall TIGF output Q) is coupled to the input of flip-flop 2492 via line 2501. 
Flip-flop 2492 also has a trigger input on line 2502. 
5 Returning to the multiplexer 2493, its selector input is coupled to the output of AND gate 

2497 via line 2509. AND gate 2497 receives one of its inputs from the CLK signal on line 2510 
and the other mput from the output of flip-flop 2496 via line 2512. Flip-flop 2496 also receives 
its input from the CLK signal on line 2511 and its trigger input on line 2513. 

The operation of the TIGF flip-flop embodiment of the present invention will now be 
10 discussed. In this embodiment, the TIGF flip-flop receives the trigger signal at three different 
points - the D flip-flop 2491 via line 2499, the D flip-flop 2492 via line 2502, and the D flip-flop 
2496 via line 2513. 

The TIGF flip-flop stores the input value only when an edge of the clock signal has been 
detected. In accordance with one embodiment of the present invention, the required edge is the 
15 positive edge of the clock signal. To detect this positive edge of the clock signal, an edge 
detector 2515 has been provided. The edge detector 2515 includes a D flip-flop 2496 and an 
AND gate 2497. The edge detector 2515 is also updated via the trigger signal on line 2513 of the 
D flip-flop 2496. 

The D flip-flop 2491 holds the new input value of the TIGF flip-flop and resists any 
20 changes to the D input on line 2498 until the trigger signal is provided on line 2499. Thus, 
before each evaluation period of the TIGF flip-flop, the new value is stored in the D flip-flop 
2491. Accordingly, the TIGF flip-flop avoids hold time violations by pre-storing the new value 
until the TIGF flip-flop is updated by the trigger signal. 

D flip-flop 2492 holds the current value (or old value) of the TIGF flip-flop until the 
25 trigger signal is provided on line 2502. This value is the state of the emulated TIGF flip-flop 
after it has been updated and before the next evaluation period. The input to the D flip-flop 2492 
on line 2501 holds the new value (which is the same value on line 2500, for a significant duration 
of the evaluation period). 

The multiplexer 2493 receives the new input value on line 2500 and the old value that is 
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currently stored in the TIGF flip-flop on line 2503. Based on the selector signal on line 2504, the 
multiplexer outputs either the new value (line 2500) or the old value (Ime 2503) as the output of 
the emulated TIGF flip-flop. This output changes with any clock glitches before all of the 
propagated signals in the user design's hardware model approach steady-state. Thus, the input on 
5 line 2501 will present the new value that is stored in flip-flop 2491 by the end of the evaluation 
period. When the trigger signal is received bythe TIGF flip-flop, flip-flop 2492 now stores the 
new value that was present in line 2501 and the flip-flop 2491 stores the next new value on line 
2498. Thus, the TIGF flip-flop in accordance with one embodiment of the present invention is 
not negatively affected by clock glitches. 

10 To further elaborate, this TIGF flip-flop also provides some immunity against clock 

glitches. One ordinarily skilled in the art will realize that by replacing the flip-flops 2420, 2421, 
and 2423 in FIG. 77(A) with the TIGF flip-flop embodunent of FIG. 81(B), clock glitches will 
not impact any circuit utilizing this TIGF flip-flop. Referring to FIGS. 77(A) and 77(B) for a 
moment, a clock glitch negatively impacted the circuit of FIG. 77(A) because for the time 

15 between time tj and t2, the flip-flop 2423 clocked in a new value when it should not have clocked 
in a new value. The skewed nature of the CLKl and CLK2 signals forced XOR gate 2422 to 
generate a logic 1 state during the time period between time tj and t^, which drove the clock line 
of the next flip-flop 2423. With the TIGF flip-flop in accordance with one embodiment of the 
present invention, the clock glitches will not affect its clocking in of the new value. Substituting 

20 the flip-flop 2423 with the TIGF flip-flop, once the signals have achieved steady-state during the 
evaluation period, the trigger signal during the short trigger period will enable the TIGF flip-flop 
to store the new value in flip-flop 2491 (FIG. 81(B). Thereafter, any clock glitches, like the 
clock glitch of FIG. 77(B) during the time interval from time t^ and tj, will not clock in a new 
value. The TIGF flip-flop updates only with the trigger signal and this trigger signal will not be 

25 presented to the TIGF flip-flop until after the evaluation period when the signals propagating 
through the circuit have achieved steady-state. 

Although this particular embodiment of the TIGF flip-flop is a D-type flip-flop, other flip- 
flops (e.g., T, JK, SR) are within the scope of the present invention. Other types of edge-triggered 
flip-flops can be derived from the D flip-flop by adding some AND/OR logic before the D input. 
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G. DYNAMIC LOGIC EVALUATION 

One embodiment of the present invention provides a dynamic logic evaluation system and 
method which dynamically calculates the evaluation time necessary for each input. In contrast, 
5 the prior art systems provide for a fixed and statically calculated evaluation time that is primarily 
based on the worst possible evaluation time caused by the worst possible circuit/trace length path. 
Thus, this embodiment of the preset invention will remove the performance burden that a fixed 
and statically calculated evaluation tune would introduce. This dynamic logic evaluation system 
and method will not penalize 99% of the inputs for the sake of the 1 % of the inputs that need the 
10 worst possible evaluation time. By dynamically calculating different evaluation times based on 
the input, the overall evaluation time is shortened by 10 to 100 times compared to the current 
statically calculated constant evaluation time techniques. In addition, the static loop problem will 
be a non-issue. 

A system diagram is provided on FIG. 90. In this exemplary diagram, four FPGA chips 
15 2710-2713 are shown. However, any number of FPGA chips and boards can be provided while 
still incorporating the dynamic logic evaluation system in accordance with one embodiment of the 
present invention. As discussed throughout this patent specification, the FPGA chips collectively 
contain the hardware model of the user's circuit design. Because the hardware model of the 
user's circuit design is spread across multiple FPGA chips, the input can propagate from one 
20 FPGA chip to another. For example, FPGA chip 2710 accepts some input and the resulting 
process of that input becomes a2 and dl, as illustrated in FIG. 90. Data a2 makes its way to 
FPGA chip 2711, while data dl is delivered to FPGA chip 2713. Similarly, data d2 in FPGA 
chip 2713 is delivered to FPGA chip 2710 and data cl is delivered to FPGA chip 2712. The 
dynamic logic evaluation system keeps track of these propagating data in dynamically determining 
25 the evaluation time. 

The evaluation time must be designed to be long enough to allow any given input to be 
evaluated properly until the corresponding output stabilizes. So, if the input is processed and the 
changing data (if any) propagates through the FPGA chips, the dynamic logic evaluation system 
recognizes that the output has not stabilized yet. Accordingly, no new input must be processed at 
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this point. In time though, the output will stabilize for a given input. Once the output has 
stabilized, the dynamic logic evaluation system will then instruct the next input to be processed. 

In accordance with one embodiment of the present invention, the dynamic logic evaluation 
system and method comprises a global control unit 2700 which is controlled by a master clock. 
5 This global control unit 2700 is coupled to several FPGA chips 2710-2713 in general and 
propagation detectors 2704-2707 in particular. In each FPGA chip, a propagation detector is 
provided. So, FPGA chip 2710 contains propagation detector 2704, FPGA chip 2711 contains 
propagation detector 2705, FPGA chip 2712 contains propagation detector 2706, and FPGA chip 
2713 contains propagation detector 2707. 
13 10 The propagation detector in each FPGA chip alerts the global control unit 2700 of any 

% input data that is currently propagating within the FPGA chips, which implies that the oixtput has 
W not stabilized yet. Specifically, the propagation detector in each FPGA chip detects inter-chip 
H propagation of data; that is, the propagation detector detects those data that is in the process of 

moving from one chip to another. The propagation detector does not care about those data that is 
15 propagating or otherwise changing within a chip if these same data are not moving across chips. 
Thus, data al in chip 2711 needs to propagate to chip 2710, so the propagation detector 2705 will 
ry detect this propagation. Similarly, data b2 in chip 2711 is planning on propagating to chip 2712 
r: so the propagation detector 2705 will detect this propagation. Other data that is changing in chip 
2711 will not be monitored if these changing data are not moving to another chip. 
20 As long as the relevant input data is propagating, the global control unit 2700 will prevent 

the next input from being provided to the FPGA chips for evaluation. The global control unit 
2700 uses the next input signal on line 2703 for this purpose. In effect, so long as the output has 
not stabilized with the given input, the next set of inputs will not be processed. Once the output 
has stabilized, the global control unit 2700 will then instruct the system to accept and process the 
25 next set of input data with the next input signal on line 2703. 

Thus, the global control unit 2700 in conjunction with the propagation detectors can 
dynamically provide varying evaluation time periods based on the needs of the input data. 
Whether the system needs longer or shorter evaluation times, the system will dynamically adjust 
the amount of evaluation time necessary to properly process that input and then move on to the 
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next evaluation time for the next set of inputs. As signals stabilize sooner, the faster the logic 
evaluation process. For the 1 % case where the input requires the worst possible evaluation time, 
the global control unit 2700 will delay the expiration of the evaluation time until the output has 
stabilized. 

5 How does the global control unit 2700 know how long to extend the evaluation time? The 

global control unit 2700 uses a global propagation delay register (PDR) 2701 and a global 
propagation delay counter (PDC) 2702. The PDR 2701 contains the value of a particular number 
of cycles. In one embodiment, this value is 10 cycles. However, this value can range anywhere 
from 1 to 10, however, other values beyond 10 are also possible. The value in the PDR 2701 is 

□ 10 the maximum delay in sending data from one FPGA chip to another. It is not necessarily the 

^ worst possible evaluation time. 

VI The PDC 2702 is a down counter. The PDC 2702 counts down at every master clock 

%| cycle from whatever value is in the counter. The PDC 2702 normally gets the counter value 
in from the PDR 2701. When the down counter PDC 2702 reaches 0, the next input signal online 
t. 15 2703 is triggered. So, if the PDR 2701 contained the value 5 and the PDC 2702 is instructed to 
& load the PDR vahie, then the down counter PDC 2702 counts down from 5 cycles at every master 
ry cycle. In 5 cycles, the down counter PDC 2702 reaches 0 and the global control unit 2700 sends 
[1 the next input signal on line 2703 to instruct the system to process the next input. Note that the 

value in the PDR 2701 does not determine the length of the evaluation time; rather, the 
20 propagation detection logic determines the evaluation tune. PDR 2701 provides the extra delay 

control needed after detecting the last propagation activity from any given FPGA chip and 

ensures that the propagation activity reaches its coimected FPGAs. 

The PDR 2701 holds a value that represents the maximum delay (in number of master 

clock cycles) that is needed for a signal to propagate between two FPGA chips. Usually, these 
25 chips are neighboring chips and are directly connected to each other. Depending on the 

interconnect technology, this PDR value can be as small as 1 and as large as 10. Typically, this 

number is less than 10 for most systems. The PDC down counter 2702 is loaded with the value 

of the PDR at the start of each evaluation cycle or when the global propagation signal on line 

2714 asserts (as described further below). 
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In one embodiment, the interconnect technology uses multiplexers at the boundaries of 
each chip to save pin-outs. Thus, each FPGA chip uses an N-to-1 mux to transport the data from 
that chip to another chip. Time-division multiplexing techniques are used to ensure that all the 
relevant data makes its way to the other chips via this mux. This multiplexing technique is 
5 described elsewhere in this patent specification. Thus, if a 5-to-l mux is used to deliver the data 
from chip 2713 to chip 2712, the PDR 2701 holds a value of 5 so that each of the five inputs to 
the 5-to-l mux is transported to the other chip at each cycle. Until all of the data at the input of 
this 5-to-l mux has been transported to the next chip, the dynamic logic evaluation system will 
prevent the next input from being processed. In another embodiment, event detection techniques 

10 are used, not time-division multiplexing. 

In this embodiment, a master clock controls the operation of these components. Thus, the 
PDC 2702 relies on the master clock input to count down. The propagation detectors 2704-2707 
rely on the master clock to determine whether any data in their respective chips are propagating. 
How do the propagation detectors alert the global control unit 2700 via the PDC 2702 that 

15 data is still propagating in the FPGA chips? All of the outputs of the propagation detectors are 
coupled to each other in a wired-OR configuration. In other words, the outputs of propagation 
detector 2704-2707 are coupled to line 2714, which is coupled to the LD input of the down 
counter PDC 2702 in the global control unit 2700. Because the outputs of the propagation 
detectors are connected in a wired-OR configuration to line 2714, whenever any of these outputs 

20 is a logic "1," the LD input of PDC 2702 will receive a logic "1" signal to trigger the loadmg 
process. This signal on line 2714 is called the global propagation signal or the propagation detect 
(PD) signal. When the LD input is enabled by the logic "1," the PDC 2702 will load the PDR 
value in PDR 2701 and the PDC 2702 will count down at every master clock cycle. As 
mentioned above, the PDC down counter 2702 is loaded with the value of the PDR at the start of 

25 each evaluation cycle or when the global propagation signal on line 2714 asserts. 

In this manner, the longest trace length or the worst possible circuit path need not be used 
to statically determine a fixed worst possible evaluation time. So long as the propagation detector 
in each FPGA detects inter-chip propagation of data, the dynamic logic evaluation system will not 
process the next input. Accordingly, 99% of the input need not be unnecessarily delayed for the 
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sake of the 1 % of the input that need the worst possible evaluation time. In one embodiment, 
since a time division mux technique is used, the evaluation time in the PDR is proportional to the 
number of cycles needed to transport data across neighboring chips. To determine stability of the 
output given a particular input, the only data that are monitored are the ones that are involved in 
5 inter-chip propagation. 

A more detailed view of the propagation detector will now be provided. The propagation 
detector generally receives signals that need inter-chip transport to generate a propagation detect 
(PD) signal. The signals that need to be transported to neighboring or otherwise connected chips, 
are divided into groups of fixed-size signals. With respect to a particular chip, these signals are 
D 10 considered to be essentially output signals since these signals are being output from that chip to 
5 another chip. FIG. 91 shows an exemplary implementation of a particular propagation detector in 

a chip. In FIG. 91, the output signals in this chip are divided into three groups, where each 
"^1 group includes a group propagation detecting (GPD) logic that receives eight (8) signals. One 
In GPD logic inchides XOR 2720, XOR 2726, and D register 2723. This GPD logic receives eight 
U 15 signals at XOR 2720; another group receives eight signals at XOR 2721; and a thud group 
jO receives eight signals at XOR 2722. 

fU Each GPD logic provides a signal at its respective outputs, called the "GPD signal," hi 

n response the inputs to the GPD logic. The output of each GPD logic will become logic "0" 

immediately after the master clock. Within a clock cycle, however, the GPD signal will remain 
20 logic "0" if no input signal to the GPD logic changes value. The GPD signal will become logic 
"1" if one of the inputs to the GPD logic changes value. The GPD signal will toggle between 
logic "1" and logic "0" if more than one of the inputs to the GPD logic change values. 

When the inputs to the XOR gate 2720, for example, do not change, the GPD signal is at 
logic "0" since the two inputs to the XOR gate 2726 are logic "0." When one of the inputs to 
25 the XOR gate 2720 changes, the XOR gate 2726 generates a logic "1" (since one of the inputs to 
the XOR gate 2726 is logic " V and the other input is logic "0"). At the leading edge of the 
master clock, however, the D register 2723 provides logic " 1" to one of the inputs to XOR gate 
2726 so that the output of XOR gate 2726 is logic "0." Thus, a GPD signal at logic "1" 
indicates that an input signal to XOR gate 2720 has changed. 
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The GPD signals from the GPD logic are provided to OR gate 2729. The OR gate 
generates a combined propagation detection signal, called the " CPD signal. " When any of the 
GPD signals is a logic "1" which indicates a changing signal at the inputs to this propagation 
detector, the output of OR gate 2729 is a logic "1." Thus, a CPD signal of logic "1" indicates a 
5 changing signal at the input to the propagation detector. 

The final stage includes a CPD edge detection logic and a CPD level detection logic. The 
CPD signal from the OR gate 2729 is provided to both the CPD edge detection logic and the 
CPD level detection logic. The CPD edge detection logic includes two D registers 2730 and 
2731 in a feedback configuration. The CPD level detection logic includes a D register 2732. 
=^10 The CPD edge detection logic detects changes in the edge of the CPD signal. Normally, 

i the output of this CPD edge detection logic is a logic "0. " The first D register 2730 receives as 
f! its input a logic " 1 " (via -Vcc) . If a logic " 1" is generated at the output of OR gate 2729 (CPD 
1 signal), this logic " 1" is used as the clock signal to D register 2730. This causes the logic "1" 
t to be provided to D register 2731 at a master clock cycle. At this master clock, the D register 
'l5 2731 outputs a logic " 1" which is provided to OR gate 2733 as well as to the reset input of D 
i register 2730 in a feedback configuration. At the next master clock, D register 2730 is reset and 

the output of D register 2731 eventually remrns to logic "0." 
3 The CPD level detection logic includes a single D register 2732 to detect the change in the 

level of the CPD signal. So long as the input to the D register 2732 is at logic " 1" at the 
20 assertion of the master clock, the output of the D register 2732 is at logic " 1 . " This output is 
provided to OR gate 2733. 

The outputs fi-om the CPD edge detection logic and the CPD level detection logic are 
provided to OR gate 2733 to generate the propagation detect (PD) signal. When any of the inputs 
to the OR gate 2733 is logic "1," the PD signal wUl be logic "1." This PD signal is, of course, 
25 provided to the wired-OR line 2714 as the global propagation signal in FIG. 90. Thus, 

whenever the PD signal is logic " 1," the dynamic evaluation logic system will prevent the next 
input in the FPGA chip (e.g., next test bench input) firom being processed. When no signal at the 
input to the propagation detection logic changes, the PD signal will be logic "0." 

In sum, the dynamic evaluation logic includes a global contirol unit and a plurality of 
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propagation detectors in the FPGA chips. One propagation detector is provided in each FPGA 
chip to detect signals that want to propagate from one chip to another. If these propagating 
signals are detected, the applicable propagation detector alerts the global control unit by sending a 
propagation detect (PD) or global propagation signal. The global control unit loads a delay value 
5 from a propagation delay register (PDR) into a propagation delay counter (PDC). At each master 
clock, the PDC counts down. When the PDC finally counts down to 0, the dynamic evaluation 
logic sends a Next Input signal so that the next set of inputs can be processed. However, until 
the Next Input signal is asserted, the dynamic evaluation logic continues to evaluate the current 
set of inputs until the outputs have stabilized. 



H. EMULATION SYSTEM WITH MULTIPLE ASYNCHRONOUS CLOCKS 
Current logic emulators use external clock sources to drive logic emulators. One 
drawback with the use of such external clock sources is that an external clock source has no 

15 knowledge of the emulator and cannot adapt itself based on the internal state of the logic 

emulator. As a result, both the logic emulator system and the external hardware test bench have 
to run the clock at the speed of the worst possible evaluation time of the logic emulator. This is 
known as the "slow down" process in logic emulation. This problem was discussed above with 
respect to the dynamic evaluation logic system. 

20 In accordance with one embodiment of the present invention, the logic emulation system 

which uses the dynamic evaluation technology described herein adjusts itself to the shortest 
evaluation time based on the input stimulus. This emulation system does not use an external 
clock source as its input clock because the external clock source cannot adjust itself based on the 
emulation state (i.e., input stimulus). Instead, this emulation system generates clocks in the logic 

25 emulator to control both the logic emulator execution and the external test bench. 

Referring to FIG. 92, the emulation system includes the emulator 2870, the clock 
generator clkgen 2871, and the hardware model of user's circuit design configured in the 
reconfigurable logic elements (shown here collectively as 2876). The emulator is discussed in 
greater detail elsewhere in this patent specification. The clock generator 2871 generates clock 
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signals in hardware and provides them to various points in the emulated model via lines 2873- 
2875. This clock generator 2871 will be discussed further below. 

The emulation system may also include a test bench board 2872 which generates test 
bench data in hardware. Typically this test bench board would be a target system (e.g., user's 
5 microprocessor design within the motherboard target system). The test bench board 2872 

provides its output on representative lines 2881 and 2882, receives its input from the emulator on 
representative lines 2883 and 2884, and receives its clock from representative clock Imes 2885 
and 2886. These lines are merely representative. More or less lines may be used than are shown 
in the figure. 

calO As shown in FIG. 92, the emulator generates the clock signals with the clock generator 

J2 2871 . These clocks are provided to the test bench board 2872 via lines 2885 and 2886. Thus, 
W the test bench board 2872 does not use its own generated clock or a static external clock 

generator; rather, the test bench board uses the emulator's clock. As described herein, the clock 
fn generation logic generates the multiple asynchronous clock's while sfrictly controlling their 
Ll5 relative phase relationships. Accordingly, the logic evaluation in the emulator can increase m 
O speed. 

}y The emulator 2870 generates multiple asynchronous clocks via clock generator 2871 

H where the each generated clock's relative phase relationship with respect to all other generated 

clocks is sfrictly controlled to speed up the emulation logic evaluation. Unlike statically designed 
20 emulator systems known in the prior art, the speed of the logic evaluation in the emulator need 
not be slowed down to the worst possible evaluation time since the clocking is generated 
internally in the emulator and carefully controlled. The emulation system does not concern itself 
with the absolute tune duration of each clock, because only the phase relationship among the 
multiple asynchronous clocks is important. By retaming the phase relationship (and the initial 
25 values) among the multiple asynchronous clocks, the speed of the logic evaluation in the emulator 
can be increased. 

By coupling the selected emulator-generated clocks to the emulated design 2876, the logic 
evaluation is driven by these emulator-generated and -controlled clocks. Similarly, by coupling 
selected emulator-generated clocks to the test bench board 2872, the evaluation of data in the test 
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bench board components are also driven by these emulator-generated clocks. 

An RCC computer system which controls the emulation system, generates the software 
clock, provides software test bench data, and contains a software model of the user's design can 
also be coupled to the emulation system. However, this RCC computer system is not shown in 
5 FIG. 92. Other sections and figures in this patent specification describe and illustrate the RCC 
computer system, the target system, and the hardware accelerator (emulator) in greater detail. 

Clock Specification 

For the single clock dynamic evaluation logic, refer to the previous section. Described 
10 therein is the emulation system's ability to dynamically adjust its clocking based on the input 
stimulus. By doing so, the clock need not be statically slowed down to the worst possible 
evaluation time. Instead, the clock adjusts itself based on the nature of the input stimulus. 

In this section, the emulation system generates multiple asynchronous clocks whose phase 
relationship is strictly controlled to speed up the emulation logic evaluation. Once again, the 
15 speed of the logic evaluation in the emulator need not be slowed down to the worst possible 

evaluation time since the clocking is generated internally in the emulator and carefully controlled. 
The emulation system does not concern itself with the absolute time duration of each clock, 
because only the phase relationship among the multiple asynchronous clocks is important. By 
retaining the phase relationship (and the initial values) among the multiple asynchronous clocks, 
20 the speed of the logic evaluation in the emulator can be increased. 

One embodiment of the present invention is an emulation system that generates any 
predetermined or arbitrary number of asynchronous clocks. Each clock has the general 
waveform specification as follows: 

25 Clkgen(clksig, vO, tl, t2, tc); 

where, 

"clksig" is the clock signal; 
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"vO" is the forced current clock value (e.g., 1 or 0); 

"tl" represents the time duration from the current time to the first clksig toggle point; 
"t2" represents the time duration from the current time to the second clksig toggle point; 
"tc" represents the clock period. 

5 

Referring now to FIG. 93, three asynchronous clocks are shown. These clocks are merely 
exemplary for the purposes of teaching the invention. More (or less) than three clocks may be 
used in an actual implementation and the clock waveforms can be of any design. Conforming to 
the clkgen specification convention above, the first two clocks in FIG. 93 are defmed as follows: 

10 

Clkgen(CLKl, 0, tl, t2, tc) 
Clkgen(CLK2, 1, t3, t4, td) 

For the purpose of this discussion, the third clock is ignored. All three clocks will be 
15 discussed together in the discussion below on the operation of the clock generation scheduler. 
However, in the actual emulation system in accordance with one embodiment of the present 
mvention, all the asynchronous clocks are strictly controlled to behave m a certain way. 

Focusing on the first two clocks of FIG. 93, assume that the current time is time 2800. 
Per the clock definition, CLKl starts off at logic "0" at tune 2800 and toggles to logic "1" at 
20 time 2801 . The tune duration from time 2800 (the current time) to time 2801 is tl . CLKl then 
toggles to logic "0" at time 2802. The time duration fi-om time 2800 to time 2802 is t2. The 
period of this clock is tc, represented here as the time duration from time 2801 to time 2805 (or 
the time duration from time 2802 to tune 2806). 

Similarly, per the clock defmition, CLK2 starts off at logic " 1" at time 2800 and toggles 
25 to logic "0" at time 2802. The time duration from time 2800 (the current time) to time 2802 is 
t3. CLK2 then toggles to logic " 1" at time 2803. The time duration from tune 2800 to time 
2803 is t4. The period of this clock is td, represented here as the time duration from time 2803 
to time 2805 (or the time duration from time 2805 to time 2808). 

The clock definition is a simulation domain concept. Realization of the clock definition in 
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the emulator system itself is different from the specification. 

For these asynchronous clocks (and all other asynchronous clocks generated by the 
emulator system), the phase relationships between the clocks are important. The phase 
relationship within a single clock is not relevant. What this implies is that the absolute time 
5 durations of tl , t2, t3, t4, tc, and td are not important; what is important are the phase 
relationships between these two clocks. 

Two properties make the dynamic clock generation possible: (1) starting values of the 
clocks; and (2) phase relationship between/among the clocks. So, for the two clocks of FIG. 93, 
CLKl must start at logic "0" and CLK2 must start at logic " 1" per the clock definition. 
„J0 Thereafter, the sequence of events is as follows: 

m CLKl toggles to logic " 1 " 

J CLKl toggles to logic "0" 

H CLK2 toggles to logic "0" 

r 15 CLK2 toggles to logic " 1" 

% CLK2 toggles to logic "0" 

J* CLKl toggles to logic "1" 

; » : 

O CLK2 toggles to logic " 1 " 

... and so forth as shown in FIG. 93. 

20 

As discussed above, these two properties (i.e. , the initial value of the clocks and the phase 
relationship between the clocks) make the dynamic clock generation possible. The absolute time 
duration and phase relationship of each clock in isolation are not relevant. 

25 Clock Generation Scheduler 

If only one clock generator is used in the entire design, then only a loadable T flip-flop is 
needed to realize the clock generator in the RCC system. The T flip-flop must be loadable so 
that when swapping occurs, the current clock value can be programmed. When the RCC 
system's EvalStart signal is provided, the emulator reads the next set of input data and evaluates 
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the data. The EvalStart signal represents the start of this cycle. In one embodiment, the RCC 

system would control the toggling of the T flip-flop with the EvalStart signal. 

If more than one clock is generated, a clock generation logic is implemented in the RCC 

System. The RCC clock generation logic comprises a clock generation scheduler and a set of 
5 clock generation slices. The clock generation scheduler schedules the execution of the clock 

generation slices. Each clock generation slice represents one clock in the clkgen specification. 
The clock generation scheduler schedules the execution of the clock generation slices, 

where each slice represents one clock in the clkgen specification. FIG. 94 shows a clock 

g^ii^r^tion scheduler in accordance with one embodiment of the present invention. The clock 
fflD generation scheduler includes a subtractor 2820, a Min register 2821, a finite state machine 2822, 
|fi and a multiplexer 2823 which interact with a set of clock generation slices 2824-2826. Each 
^ clock generation slice such as clock generation slice 2825 includes a Z register (e.g., Z register 

2852) and an RO register (e.g., RO register 2853). These and other components in the clock 
- generation slice contains other components which will be discussed further below. In FIG. 94, 
^5 only three clock generation slices are shown because only three asynchronous clocks are 
Ll generated in this example. 

13 The clock generation scheduler performs the following algorithm: 

(1) find the minimum value from the RO registers of all the clock generation slices; and 
20 (2) subtract the minimum value from the RO registers of all the clock generation slices and 

set the Z register to logic "1" if the result of the subtraction is "0." 

The structure of the clock generation scheduler is as follows. In this example, three clock 
generation slices 2824-2826 are shown. The clock generation slices are coupled together through 
25 their respective Z and RO registers. 

Clock generation slice 2824 generates CLKl. It is coupled to clock generation slice 2825 
via line 2839 (which couples the Z registers in both slices together) and line 2842 (which couples 
the RO registers in both slices together). The RO register of slice 2824 is coupled via line 2831a 
to the Min register 2821 via line 2831c, the subtractor 2820 via line 2831b, and the mux 2823 via 
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line 283 Id. The slice 2824 also receives control signals from finite state machine 2822 via line 
2836 (Next signal) and the RCC System via line 2835 (EvalStart signal). 

Clock generation slice 2825 generates CLK2. It is coupled to clock generation slice 2824 
via line 2839 (which couples the Z registers m both slices together) and line 2842 (which couples 
5 the RO registers in both slices together). In addition, slice 2825 is coupled to slice 2826 via line 
2838 (which couples the Z registers in both slices together) and line 2841 (which couples the RO 
registers in both slices together). The slice 2825 also receives control signals from finite state 
machine 2822 via line 2836 (Next signal) and the RCC System via line 2835 (EvalStart signal). 
Clock generation slice 2826 generates CLK3, Slice 2826 is coupled to slice 2825 via line 
OlO 2838 (which couples the Z registers in both slices together) and line 2841 (which couples the RO 
5 registers in both slices together). Slice 2826 also receives the output of mux 2823 in its RO 

register via line 2840, and a control signal from the subtractor 2820 into its Z register via line 
'"J 2837. Slice 2826 also receives control signals from finite state machine 2822 via line 2836 (Next 
in signal) and the RCC System via line 2835 (EvalStart signal). 

ij^l 15 The subtractor 2820 receives as its inputs the value of the RO register in slice 2824 via 

fj line 2831b and the current minimum value in the Min register 2821 via line 2832. Incidentally, 
fU the value of the RO register in slice 2824 is also provided to mux 2823 via line 283 Id as one of 
P the inputs to the mux. These two input values in the subtractor 2820 are subtracted and the result 
(" SUB RESULT" ) provided on line 2830 as one of the inputs to mux 2823 . 
20 As described further below, the subtractor compares the RO values in all the slices and 

performs the subtraction. If the result of the subtraction is "0," the subtractor provides a logic 
"1" to the Z register in slice 2826 via line 2837, otherwise the subtractor provides a logic "0" 
on line 2837, During the stage when the minimum value among the RO registers is being 
determined, the mux outputs the RO value, not the SUB RESULT in subtractor 2820. 
25 The Min register 2821 holds the minimum RO value and provides this minimum value to 

the subtractor 2820 via line 2832. At the start of each EvalStart cycle, as indicated by the 
EvalStart signal on line 2835, the Min register 2821 is loaded with the maximum possible value 
based on the number of digits in the register. This is done by setting all the digits to logic " 1 
Thereafter, the next RO that is received by the Min register 2821 via Ime 2831c will be the new 
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minimum value. A new RO value is provided from the RO register in slice 2824 to the Min 
register via line 2831c. If this new RO value is less than the current minimum, this new RO value 
displaces the current minimum value as the new minimum value. A load signal on line 2834 
from the finite state machine 2822 loads this RO value as the new minimum value. 
5 The mux 2823 receives as its inputs the current RO value from the RO register in slice 

2824 via line 283 Id and the current subtraction result from the subtracter 2820 via line 2830. 
The output of the mux 2823 is provided on line 2840 to the RO register in slice 2826. A control 
signal is provided by the finite state machine 2822 via line 2845. 

As discussed farther below, the clock scheduler performs its operations through two 
^IJO stages - (1) determine the minimum value among the RO register values, and (2) subtract this 
S minimum value from the RO register values. The control signal selects the RO register value on 
line 283 Id during the minimum RO value seek stage. However, during the subtraction stage, the 
I r control signal selects the subtraction result from the subtracter 2820 on line 2830, Whatever 

value is output from the mux 2823 writes over the RO register of slice 2826 . 
Q15 The finite state machine 2822 schedules the execution of the above two-step algorithm by 

il providmg control signals to the various components of this clock generation scheduler. If the 
LH current RO value in the RO register of slice 2824 is less than the current minimum value in the 
H Min register 2821, then a logic "1" signal is provided to the finite state machine 2822 via line 
2833. In addition, the load signal on line 2834 loads the current RO value as the new minimum 
20 value in the Min register 2821 if this new RO value is less than the minimum value in the Min 
register 2821. The finite state machine 2822 is also made aware of the EvalStart signal on line 
2835 and also provides the Next signal on line 2836. The Next signal is analogous to a next 
instruction command. For the clock scheduler, the EvalStart signal is used to rotate register 
values among the RO, Rl, and R2 registers within a wmning clock generation slice. However, 
25 the Next signal is used to globally rotate register values across multiple clock generation slices. 



Clock Generation Slice 

In FIG. 94, three exemplary clock generation slices are shown. To examine the clock 
generation slices in more detail, refer now to FIG. 95. Here clock generation slice 2825, which 
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generates CLK2, is illustrated in greater detail. Clock generation slice 2825 contains five 
loadable registers -a T flip-flop 2851, a Z register 2852, an RO register 2853, an Rl register 
2854, and an R2 register 2855. A control logic 2850 is provided to control the operation of these 
five registers. 

5 The T flip-flop 2851 holds the clock value (i.e., logic " 1" or "0") on line 2860 and thus 

represents CLK2 for this slice 2825. This T flip-flop register is initialized to "vo" per the clkgen 
clock definition and toggles when both the Z register 2852 and the EvalStart signal on line 2835 
are at logic "1." The T flip-flop 2851 also receives a control signal fi-om the control logic 2850 
via line 2861 to control when the T flip-flop 2851 should toggle. 
1^0 The RO register 2853 keeps the time duration from the current time to the next trigger 

^ 3 point. The RCC software will initialize the RO register 2853 to tl per the clkgen clock definition. 

The RO register 2853 in this slice 2825 links to other clock generation slices in a rotation ring 
i2 for the clock scheduling. The previous RO fi-om a neighboring slice is provided on line 2841, 

while the current RO value in the RO register 2853 of this slice 2825 is provided on line 2842 to 
13 15 the next RO register in the next neighboring slice. The Rl register 2854 outputs its value to the 
il RO register 2853 via line 2865 at the assertion of the Next signal from the clock generation 
I; scheduler. The Next signal from the scheduler will rotate Rl with its neighboring slices. 
H The Rl register 2854 keeps the time duration from the first toggle point to the second 

toggle point. The RCC system software will initialize Rl to (t2-tl). The Rl register 2854 
20 receives some value from the R2 register 2855 via line 2863, provides its value to the R2 register 
2855 via line 2864, and provides its value to the RO register 2853 via line 2865 at the assertion of 
the EvalStart signal. The control logic 2850 receives this EvalStart signal and translates it to a 
control signal on line 2867 to the Rl and R2 registers to rotate their respective values 
accordingly. 

25 The R2 register 2855 keeps the thne duration from the second toggle point to the next first 

toggle point. The RCC system software will initialize R2 to (tc-t2+tl). The R2 register 2855 
receives some value from the Rl register 2854 via line 2864, and provides its value to the Rl 
register 2854 via line 2863 at the assertion of the EvalStart signal. The control logic 2850 
receives this EvalStart signal (and Z register value) and translates it to a control signal on line 
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2867 to the Rl and R2 registers to rotate their respective values accordingly. 

With respect to the relationship of the RO, Rl, and R2 registers, Rl transfers its value to 
RO, while Rl and R2 rotates when both the Z register 2852 and the EvalStart signal on line 2835 
are at logic "1." The rotation occurs whenever the clock slice associated with these registers 
5 wins the comparison of the lowest RO value (i.e. , closest next toggle point from the current time). 
All other RO, Rl, and R2 registers in the losing clock slices do not rotate. However, the values 
in the RO registers for these losing clock slices are adjusted for the current time. 

The Z register 2852 partially controls the toggling of the clock value and the rotation of 
the RO, Rl, and R2 register values. If the value of the RO register becomes logic "0," then the 
010 value of the Z register becomes logic "1." The Z register 2852 is linked to its neighbormg slices 
fi in a shift pipe for clock scheduling via lines 2838 and 2839. The Next signal from the clock 

f: generation scheduler will rotate the value in the Z register 2852 with its neighboring slices. The 

"SI 

U control logic 2850 receives this Next signal and translates it to a control signal on line 2862 to the 
J Z register to shift its value down the pipe. Also, the value of the Z register is provided to the 

^ 15 control logic 2850 on line 2866 so that the control logic can determine whether to toggle the T 
H flip-flop 2851 for the clock signal. If both the Z register value and the EvalStart signal are at 
n logic "1/' then the control logic 2850 will toggle the T flip-flop 2851, 

The control logic 2850 controls the operation of the five registers in this slice 2825. Also, 
the value of the Z register 2852 is provided to the control logic 2850 on line 2866 so that the 
20 control logic can determine whether to toggle the T flip-flop 2851 for the clock signal. If both 
the Z register value and the EvalStart signal are at logic "1,*' then the control logic 2850 will 
toggle the T flip-flop 285 L The control logic 2850 delivers a control signal via line 2861 to 
control when the T flip-flop 2851 should toggle. The control logic 2850 receives an EvalStart 
signal on line 2836 and translates it to a control signal on line 2867 to the Rl and R2 registers to 
25 rotate their respective values accordingly. The control logic 2850 also receives this same Next 
signal and translates it to a control signal on line 2862 to the Z register to shift its value down the 
pipe with its neighboring slices. 

Operation of the Clock Generation Logic 
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The operation of the clock generation logic will now be described with respect to FIGS. 
96 and 93, FIG. 96 shows not only the clock generation scheduler but also the internal 
components of the clock generation slices. FIG. 93 shows three clocks. 

At a high level, the clock generation scheduler performs the following algorithm for each 
evaluation cycle, as indicated by EvalStart signal: 



(1) set initial values for all registers; 

(2) from the current time, find the next toggle point for all the clocks; 

(3) toggle the clock associated with this next toggle point; 

(4) adjust the current time to be the time associated with this toggle point; 

(5) adjust the next toggle point for the winnmg clock slice, while keeping all other clock 
slices' respective next toggle points (the toggle points will be the same for the losing slices but 
the time durations will be adjusted based on the new current time). 

Stated differently and using clock scheduler component terminology, the clock generation 
scheduler performs the following two-step algorithm: 

(1) find the minimum value from the RO registers of all the clock generation slices; and 

(2) subtract the minimum value from the RO registers of all the clock generation slices and 
set the Z register to logic "1" if the result of the subtraction is "0." 

When the EvalStart signal is provided, each clock generation slice will update its clock 
value and the finite state machine starts execution of the above two step algorithm to determine 
the next clock toggle event while the RCC system performs logic evaluation with the current set 
of input stimulus. The finite state machine rotates the RO ring twice - the first time to find the 
minimum value of all the ROs, and the second time to subtract the minimum value from the 
current ROs, An inner rotation of the RO, Rl, and R2 registers within each clock generation slice 
updates the register values so that the winning clock generation slice contains the proper next 
toggle point information for future toggle point comparisons among all the clock slices. In 
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essence, for each next toggle point comparison, the winning clock generation slice rotates the RO, 
Rl, and R2 registers, while the losing clock generation slices updates their respective RO register 
values based on the current time. 

These inner rotation operations are triggered by the EvalStart signal. After receiving the 
5 EvalStart signal, this algorithm completes its task in 2*(number of slices) cycles, which is fast 
enough for all practical designs. 

Each clock generation slice generates a single clock per the clkgen clock specification. If 
N asynchronous clocks are needed for the design, N clock generation slices will be provided. In 
FIG. 96, three clock slices are shown for the three clocks, CLKl, CLK2, and CLK3. The timing 
diagram of these three clocks are shown in FIG. 93. 
^ With respect to FIG. 93, the operation of the clock generation logic will be described for 

the initial time 2800 and four exemplary toggle points - times 2801, 2802, 2803, and 2804. 

U Current time 2800 

^ 15 Initially, the clock generation logic sets the initial values in the various registers. The 

W clock generation logic compares all the time durations from the current time to the next toggle 
|.r^ point for all three clocks. These time duration values are held in the RO registers in the clock 
slices. Initially, these tune durations are the tl values for each clock, or essentially the time 
duration from the current time to the first toggle point. So, register RO for CLKl clock slice 
20 2824 holds the time duration from time 2800 to time 2801, register RO for CLK2 clock slice 2825 
holds the time duration from time 2800 to time 2802, and register RO for CLK3 clock slice 2826 
holds the time duration from time 2800 to time 2804. 

Based on the comparison, the clock generation logic selects the lowest time duration 
because this time duration represents the next closest toggle point. The clock associated with this 
25 lowest time duration toggles. In FIG. 93, this next toggle point is represented by CLKl, which 
toggles at time 2801. This clock slice represents the winning clock slice since it is associated 
with the next toggle point, or the lowest RO value among all the RO registers. Note that at this 
point, the comparisons have been done with first toggle points for each of the three clocks. 
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The clock generation logic then subtracts this time duration (time 2800 to time 2801) from 
the other time durations in the RO registers of their respective clock slices. The emulation system 
(and the RCC system) now views time 2801 as the current tune. After this subtraction, the clock 
generation logic is now ready to look for the next toggle point. These comparison and 
5 subtraction steps are accomplished with the Next signal for globally rotating the RO values across 
multiple clock generation slices. 

Prior to looking for the next toggle point, the clock generation logic rotates the value of 
the RO, Rl, and R2 registers of the winning slice, in this case slice 2824, with the assertion of the 
EvalStart signal. Register RO would now contain the time duration from the prior first toggle 
10 point to a second toggle point. Here, this is represented by the time duration from time 2801 to 
time 2802. Register Rl would now contain the time duration from this second toggle point to the 
next first toggle point (time 2802 to time 2805), while register R2 would hold the time duration 
from the first toggle point to the second toggle point (time 2801 to time 2802). Although the 
winning slice (slice 2824 in this example) would hold this new time duration in the RO register, 
15 all the other slices would retain their original time duration to the first toggle point with some 
adjustment for the new current time (now time 2801). After all, the valid comparisons should be 
the updated next toggle point of the winning slice and the next toggle point of all the losing 
slices. 

20 Current time 2801 

With the current time at time 2801 (based on the subtraction), the clock generation logic 
then compares the time duration to the next toggle point for each of the clocks. Once again, these 
time durations are held in the RO registers in the clock slices. For CLKl, this is the time 
duration from time 2801 to time 2802. For CLK2, its register RO holds the time duration from 
25 time 2801 to time 2802. For CLK3, its register RO holds the time duration from time 2801 to 
time 2804. For CLK2 and CLK3, the values are adjusted from the previous evaluation cycle 
based on the new current tune (now tune 2801). 

The clock generation logic compares all the time durations from the current time (now 
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time 2801) to the next toggle point for all three clocks. These time duration values are held in 
the RO registers in the clock slices as described above. Based on the comparison, the clock 
generation logic selects the lowest time duration because this time duration represents the next 
closest toggle point. The clock associated with this lowest time duration toggles. In FIG. 93, 
5 this next toggle point is represented by CLKl again, which toggles at time 2802. This clock slice 
represents the winning clock slice since it is associated with the next toggle point, or the lowest 
RO value among all the RO registers. 

The clock generation logic then subtracts this time duration (time 2801 to time 2801) from 
the other time durations in the RO registers of their respective clock slices. The emulation system 
^nlO (and the RCC system) now views time 2802 as the current time. After this subtraction, the clock 
generation logic is now ready to look for the next toggle point. 

Prior to looking for the next toggle point, the clock generation logic rotates the value of 

"E 

H the RO, Rl, and R2 registers of the winning slice, in this case slice 2824. Register RO would 
J ' now contain the time duration from the prior second toggle point to the next first toggle point. 
^% 15 Here, this is represented by the time duration from time 2802 to time 2805. Register Rl would 
p now contain the time duration from this next first toggle point to the second toggle point (time 
n 2805 to time 2806), while register R2 would hold the time duration from this second toggle point 
to the next first toggle point (time 2806 to time 281 1). Although the wirming slice (slice 2824 in 
this example) would hold this new time duration in the RO register, all the other slices would 
20 retain their origmal time duration to their respective first toggle point with some adjustment for 
the new current time (now time 2802). After all, the valid comparisons should be the updated 
next toggle point of the winning slice and the next toggle point of all the losing slices. 

Current time 2802 

25 With the current time at time 2802 (based on the subtraction), the clock generation logic 

then compares the time duration to the next toggle point for each of the clocks. Once again, these 
time durations are held in the RO registers in the clock slices. For CLKl, this is the time 
duration from time 2802 to time 2805. For CLK2, its register RO holds the time duration from 
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time 2802 to time 2802. For CLK3, its register RO holds the time duration from time 2802 to 
time 2804. For CLK2 and CLK3, the values are adjusted from the previous evaluation cycle 
based on the new current time (now time 2802). 

The clock generation logic compares all the time durations from the current time (now 
5 time 2802) to the next toggle point for all three clocks. These time duration values are held in 
the RO registers in the clock slices as described above. Based on the comparison, the clock 
generation logic selects the lowest time duration because this time duration represents the next 
closest toggle point. The clock associated with this lowest time duration toggles. In FIG. 93, 
this next toggle point is represented by CLK2, which toggles at time 2802. This clock slice 

JlO represents the winning clock slice since it is associated with the next toggle point, or the lowest 

:|! RO value among all the RO registers. 

^ The clock generation logic then subtracts this time duration (time 2802 to time 2802) from 

1^ the other time durations in the RO registers of their respective clock slices. The emulation system 
(and the RCC system) now views time 2802 as the current time, even though this is the same 

y 15 current time as the last evaluation cycle. This is because two clocks toggled at this same time. 

H After this subtraction, the clock generation logic is now ready to look for the next toggle point. 

1^1 Prior to looking for the next toggle point, the clock generation logic rotates the value of 

the RO, Rl, and R2 registers of the winning slice, in this case slice 2825. Register RO would 
now contain the time duration from the prior first toggle point to the second toggle point. Here, 
20 this is represented by the time duration from time 2802 to time 2803. Register Rl would now 
contain the time duration from this second toggle point to the next first toggle point (time 2803 to 
time 2810), while register R2 would hold the time duration from the first toggle point to the 
second toggle point (time 2810 to time 2805). Although the winning slice (slice 2825 in this 
example) would hold this new time duration in the RO register, all the other slices would retain 
25 their original time duration to their respective next toggle points with some adjustment for the 
new current time (now time 2802). After all, the valid comparisons should be the updated next 
toggle point of the winning slice and the next toggle point of all the losing slices. 
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Current time 2802 (again) 

With the current time at time 2802 (based on the subtraction), the clock generation logic 
then compares the time duration to the next toggle point for each of the clocks. Once again, these 
time durations are held in the RO registers in the clock slices. For CLKl, this is the time 
5 duration from time 2802 to time 2805. For CLK2, its register RO holds the time duration from 
time 2802 to time 2803. For CLK3, its register RO holds the time duration from time 2802 to 
time 2804. For CLKl and CLK3, the values are adjusted from the previous evaluation cycle 
based on the new current time (now time 2802). 

The clock generation logic compares all the time durations from the current time (now 
^glO time 2802) to the next toggle point for all three clocks. These time duration values are held in 
^ the RO registers in the clock slices as described above. Based on the comparison, the clock 

generation logic selects the lowest time duration because this time duration represents the next 
closest toggle point. The clock associated with this lowest time duration toggles. In FIG. 93, 
this next toggle point is represented by CLK2 again, which toggles at time 2803. This clock slice 
J 15 represents the winning clock slice since it is associated with the next toggle point, or the lowest 
RO value among all the RO registers. 

The clock generation logic then subtracts this time duration (time 2802 to time 2803) from 
the other time durations in the RO registers of their respective clock slices. The emulation system 
(and the RCC system) now views time 2803 as the current time. After this subtraction, the clock 
20 generation logic is now ready to look for the next toggle point. 

Prior to looking for the next toggle point, the clock generation logic rotates the value of 
the RO, Rl, and R2 registers of the wiiming slice, in this case slice 2825. Register RO would 
now contain the time duration from the second toggle point to the next first toggle point. Here, 
this is represented by the time duration from time 2803 to time 2810. Register Rl would now 
25 contain the time duration from the first toggle point to the second toggle point (time 2810 to time 
2805), while register R2 would hold the time duration from the second toggle point to the next 
first toggle point (time 2805 to time 2812). Although the winning slice (slice 2825 in this 
example) would hold this new time duration in the RO register, all the other slices would retain 
their original time duration to their respective next toggle points with some adjustment for the 
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new current time (now time 2803). After all, the valid comparisons should be the updated next 
toggle point of the winning slice and the next toggle point of all the losing slices. 

Current time 2803 

5 With the current time at time 2803 (based on the subtraction), the clock generation logic 

then compares the time duration to the next toggle point for each of the clocks. Once again, these 
time durations are held in the RO registers in the clock slices. For CLKl, this is the time 
duration from time 2803 to time 2805. For CLK2, its register RO holds the time duration from 
O time 2803 to time 2810. For CLK3, its register RO holds the time duration from time 2803 to 
ifllO time 2804. For CLKl and CLIC3, the values are adjusted from the previous evaluation cycle 
'if based on the new current time (now time 2803). 

y The clock generation logic compares all the time durations from the current time (now 

iH time 2803) to the next toggle point for all three clocks. These time duration values are held in 
f=g the RO registers in the clock slices as described above. Based on the comparison, the clock 
^ 15 generation logic selects the lowest time duration because this time duration represents the next 
ry closest toggle point. The clock associated with this lowest time duration toggles. In FIG. 93, 
il this next toggle point is represented by CLK3, which toggles at time 2804. This clock slice 2826 
represents the winning clock slice since it is associated with the next toggle point, or the lowest 
RO value among all the RO registers. 
20 The clock generation logic then subtracts this time duration (time 2803 to time 2804) from 

the other time durations in the RO registers of their respective clock slices. The emulation system 
(and the RCC system) now views time 2804 as the current time. After this subtraction, the clock 
generation logic is now ready to look for the next toggle point. 

Prior to looking for the next toggle point, the clock generation logic rotates the value of 
25 the RO, Rl, and R2 registers of the winning slice, m this case slice 2826, in the manner described 
above. Register RO would now contain the value from the Rl register, while register Rl and R2 
swap values. Although the winning slice (slice 2826 in this example) would hold this new time 
duration in the RO register, all the other slices would retain their original time duration to their 
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respective next toggle points with some adjustment for the new current time (now time 2804). 
After all, the valid comparisons should be the updated next toggle point of the winning slice and 
the next toggle point of all the losing slices. 

5 In sum, the emulator generates multiple asynchronous clocks via a clock generation logic 

where each generated clock's relative phase relationship with respect to all other generated clocks 
is strictly controlled to speed up the emulation logic evaluation. Unlike statically designed 
emulator systems known in the prior art, the speed of the logic evaluation in the emulator need 
not be slowed down to the worst possible evaluation time since the clocking is generated 
'trllO internally in the emulator and carefully controlled. The emulation system does not concern itself 
y3 with the absolute time duration of each clock, because only the phase relationship among the 

multiple asynchronous clocks is important. By retaining the phase relationship (and the initial 
i2 values) among the multiple asynchronous clocks, the speed of the logic evaluation in the emulator 
'^'^ can be increased. This is accomplished with a clock generation logic that comprises a clock 
Q 15 generation scheduler and a set of clock generation slices, where each clock generation slice 
u generates a clock. The clock generation scheduler compares each clock's next toggle point from 
ii; the current time, toggles the clock associated with the winning next toggle point, determines the 
new current time, updates the next toggle point information for all of the clock generation slices, 
and performs the comparison again in the next evaluation cycle. In the update phase, the winning 
20 slice updates its register with a new next toggle point, while the losing slices merely updates their 
respective registers by adjusting for the new current time. 

I. INTER-CHIP COMMUNICATION 

25 Brief Background 

As explained in the background section above, FPGA chips are used in some prior art 
verification systems. However, FPGA chips are limited in the number of pins. If a single chip is 
used, this is not a major problem. But, when multiple chips are used to model the any portion of 
the user design for emulation purposes, some scheme must be used to allow for these multiple 
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chips to communicate with each other. For the most part, prior verification systems utilize 
dedicated hardware schemes (e.g., direct connection's cross-bar) or TDM schemes (e.g., virtual 
wires technology). These prior art systems suffer from high cost of providing dedicated 
hardware resources (cross-bar) and low performance due to necessary extra cycles (virtual wires). 
5 A more detailed explanation was provided in the Background of the Invention section of this 
patent application above. 

General Overview 

In accordance with one embodiment of the present invention, an inter-chip communication 
^iO system is provided which saves hardware costs while approaching the performance gains of the 
^0 dedicated direct connection scheme. In this scheme, only those data that changed in value are 
M transferred, thus saving cycles. Unlike TDM schemes, no cycles are wasted to transfer data that 
, : did not change value. 

To fully describe the inter-chip communication system m accordance with one 
015 embodiment of the present invention, imagine two FPGA chips such as chips 1565 and 1566 in 
2 FIG. 39. These chips correspond to chips FPGAO and FPGA2 in board6 at the top of the figure. 
Note that these chips are provided in the RCC hardware accelerator portion of the verification 
system for the modeling of the user design in hardware. Although these particular chips 1565 
and 1566 are co-located on the same board, the inter-chip communication system is also 
20 applicable to chips located on different boards. 

The portion of the user design that is modeled in each chip is coupled to an inter-chip 
communication logic, which includes both a transmission logic and a reception logic. The 
portion of the user design that is coupled to the inter-chip communication logic includes separated 
connections for the delivery of data. Typically, these separated connections represent the 
25 boundaries of the user design that have been separated due to the memory constraints of the 
FPGA chips. For example, assume that a user design is so large and complicated that a single 
FPGA chip is not large enough to model this user design in hardware. In fact, assume that two 
chips are necessary to adequately model this user design. So, this user design must be divided 
into two portions - one portion in one chip and the other portion in the other second chip. The 
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part where these two portions are separated represent the boundary. Separated connections are 
provided at these two portions at the boundaries where data needs to be communicated between 
these two portions. The inter-chip communication logic is coupled to these various separated 
connections for the delivery and reception of data to and from other chips, 
5 The logic circuitry on these two exemplary chips are shown in FIGS. 98 A and 98B. FIG. 

98A shows the transmission side in one chip while FIG. 98B shows the reception side in another 
chip. Of course, the transmission circuit of FIG. 98 A is also found in the chip associated with 
FIG. 98B when the chip of FIG. 98B needs to transfer data to the chip associated with FIG. 98B. 
In this case, the chip associated with FIG. 98A also includes reception circuitry, one 
'^10 embodiment of which is found in FIG. 98B. 

v3 When any data that reaches the inter-chip communication system changes in value, the 

,g inter-chip communication logic detects this event change and proceeds to schedule a time when 
i2 this changed data can be transmitted to the designated chip. Two key components of this logic 

circuitry are the event detector and the packet scheduler. An exemplary event detector is item 
C3 15 3030 and an exemplary packet scheduler is item 3036 in FIG. 98 A. With these and other logic 
il components, one chip is able to deliver data to another chip whenever any change in data values 

is detected. 

1==^ As mentioned above, the separated connections are coupled to the inter-chip 

communication logic. When any change in value in the data at these separated connections is 
20 detected by the event detector, the inter-chip communication logic proceeds to schedule the 
delivery of these changed data to the other chip. 

The delivery of the data from one chip to another is accomplished through packets. A 
packet includes a header and one or more payload data (or signal values representing the data that 
changed). More will be discussed below on the use of the header and payload information in the 
25 packets. 

Once the event detector detects an "event" (change in values), the packet scheduler gets 
involved. In one embodiment, the packet scheduler uses one form of a token ring method to 
deliver the data across the chip boundaries. When the packet scheduler receives a token and 
detects an event, the packet scheduler "grabs" the token and schedules the transmission of this 
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packet in the next packet cycle. If, however, the packet scheduler receives the token but does not 
detect an event, it will pass the token to the next packet scheduler. At the end of each packet 
cycle, the packet scheduler that grabbed the token will pass the token to the next logic associated 
with another packet. 

5 With this implementation, the packet scheduler skips idle packets (i.e., those signal groups 

which did not change in value) and prevents them from being delivered to another chip. Also, 
this scheme guarantees that all event packets have a fair chance to be delivered to the other 
designated chip. 

1^0 Chip Boundaries and Limitations 

)ff Returning to FIGS. 98 A and 98B and the illustrative example of the two chips used to 

model the user design, the right side of FIG. 98A shows the chip boundary for the first chip 
\1 which includes the transmission logic shown therein, while the left side of FIG. 98B shows the 

chip boundary for the second chip which includes the reception logic shown therein. This is the 
Q 15 separation that was made by the RCC system during the automatic component type and 
U hardware/software modeling steps early on, which was described in another section of this patent 
}^ application. The separated connections associated with both the left and right side of this 

boundary can number in the hundreds. After all, an otherwise single user design was split up 
into two portions just because the FPGA chip is not large enough in capacity to hold the hardware 
20 model of that user design. Depending on where the split was made, possibly hundreds to 

thousands of connections connecting these two split portions of the hardware model of the user 
design were "broken up," so to speak. Because data is processed or passed from one portion of 
the hardware model (in the first chip) to another portion of the hardware model (in the second 
chip), and vice versa, a communication mechanism is needed to transport these data back and 
25 forth. 

As explained above, a limited number of pin-outs are provided in each FPGA chip. In 
this example, assume that only two (2) pins are dedicated for inter-chip communication. These 
two pins are shown as connection 3075 in both FIGS. 98 A and 98B. Despite the use of a single 
item number (i.e., 3075), this connection represents two wires or pin-outs. In other words, only 
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two pins are used to transport data between the first chip associated with FIG. 98 A and the 
second chip associated with FIG. 98B in this example. 

With the event detection, packet scheduling, and transmission using the token ring 
scheme, such communication between these two chips is possible across two wkes even though 
5 the number of separated connections may number in the hundreds or thousands. 

Transmission Logic - Signal Groups 
Referring now to FIG. 98 A, the transmission logic will now be described with respect to 
the two-chip example introduced above. Based on where and how the hardware model of the 
'iO user design was "separated" into two portions into the two chips, separated connections must 
jlJ now be handled. These separated connections exist because the hardware model of the user 

t.S 'i 

design was separated at that area. In this example, assume the separated coimections are 
u represented by three signal groups SO, SI, and S2. Signal group SO is represented by reference 
number 3050, signal group SI is represented by reference number 3051, and signal group S2 is 
C3 15 represented by reference number 3052. 

The size of these signal groups can vary depending on how the hardware model of the 
^4 user design was split up in those two chips. In one embodiment, each signal group is 16 bits 
1^ wide. But because the chip only has two pin-outs for inter-chip communication, only two bits 
can be transmitted at any given time. For this particular example, however, assume that each 
20 signal group is 8 bits wide. 

Each signal group can be identified by a header. The header data is represented by hO 
(reference number 3053), hi (reference number 3054), and h2 (reference number 3055). This 
header information will be transmitted with the data in the signal groups so that the reception 
logic in the second chip can route the signal group data to the appropriate section of the hardware 
25 model placed in the second chip. 

Packets 

The delivery of the data from one chip to another is accomplished through packets. A 
packet includes a header and one or more payload data (or signal values representing the data that 
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changed). Depending on the hardware model of the user design and how it was divided up into 
the multiple chips during place-and-route, the size of the packets may vary. In the example used 
in this patent application, the packet is 10 bits long (2 bits for the header and 8 bits for the 
pay load data). 

5 As discussed below, the number of bits that are transmitted across a chip boundary 

depends on the number of pinouts dedicated for inter-chip communication. For example, if two 
pinouts are dedicated for this type of communication, only two bits are transmitted at a time. 
Thus, for a 10-bit packet, 5 scanout cycles are needed to deliver the entire 10 bits across to the 
other chip. 

\0 Transmission Logic - Event Detector 

^ The transmission logic in this example includes three event detectors 3030-3032 

correspondmg to the three signal groups 3050-3052, respectively. These event detectors are 
Ul coupled to the separated connections associated with signal groups 3050-3052. For example, 
13 15 event detector 3030 is coupled to signal group 3050 (SO). The purpose of each event detector is to 
detect "events," or changes in the values, of data associated with its respective signal group. 

The event detector is not coupled to the connections associated with the headers 3053- 
3055. In one embodiment, since headers are merely identifiers for signal groups, the header 
information does not change. In other embodiments, header information changes and the 
20 transmission and reception logic handles the changes accordingly. 

Each event detector is coupled to a packet scanout logic and a packet scheduler. In this 
example, event detector 3030 is coupled to packet scanout 3033 and packet scheduler 3036 via 
line 3062. Event detector 3031 is coupled to packet scanout 3034 and packet scheduler 3037 via 
line 3063, Event detector 3032 is coupled to packet scanout 3035 and packet scheduler 3038 via 
25 line 3064. 

Each event detector provides its data from its corresponding signal group to the packet 
scanout logic. Since only two bits (because of the two wire pinouts on the outside of the chip) can 
be transmitted at a time, the packet scanout makes sure that two bits of the signal group from its 
respective event scheduler is scanned out to the packet selector. The packet scanout logic and the 
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packet selector will be discussed below. 

Also, each event detector is coupled to its corresponding packet scheduler as mentioned 
above. When the event detector detects an "event," the packet scheduler is alerted that its signal 
group has experienced a change in data value. The packet scheduler will be discussed below. 
5 A more detailed view of an event detector is shown in FIG. 97. The event detector 3000 

includes inputs from its corresponding signal group 3010 into an XOR network 3002. As known 
to those skilled in the art, an XOR gate provides logic "1" output when an odd number of its 
inputs are at logic "1" and provides a logic "0" output when an even number of its inputs are at 
logic "0." Thus, given any combination of inputs into the XOR network 3002, any change in the 

10 input results in some change m the output due to the even-odd change of inputs. 

The XOR network 3002 provides an output 301 1 to an input port of XOR gate 3004. The 
XOR gate 3002 also provides the same output 3012 to a D flip-flop 3003, which receives a clock 
uiput CLK at line 3013. The output of the D flip-flop 3003 is provided to the second input 3014 
of XOR gate 3004. In essence, the XOR gate 3004 outputs a logic " 1" at line 3016 when any 

15 change in the inputs at 3010. This logic "1" signal to the packet scheduler 3001 is the trigger 
indicator to alert the packet scheduler 3001 that an event has occurred. The packet scheduler 
3001 will be discussed in greater detail below. 

Note in FIG. 98A that the input signal groups are also provided to the packet scanout unit. 
These details are self-explanatory by those ordinarily skilled in the art and are not shown in FIG. 

20 97. No further explanation is necessary. 

Transmission Logic - Packet Scanout 
A packet scanout logic is provided to scan out the appropriate number of data groups 
within a signal group. In this example, the number of pinouts is 2, so the 8-bit signal group (and 
25 the 2-bit header) is divided up into 2-bit data groups since the transmission logic is designed to 
transmit 2 bits to the reception logic in the other chip due to the 2 pinouts. Thus, 5 scanout 
cycles are needed to transmit the entire 10-bit packet (signal group and header). First the header 
[0:1], then the next two bits [2:3], then the next two bits [4:5], then the next two bits [6:7], and 
finally the final two bits [8:9]. 
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A packet scanout logic is provided for each of the signal groups. In this example, three 
packet scanout logic 3033-3035 are provided to support the three signal groups 3050-3052 in 
FIG. 98A. Each packet scanout logic receives the header information, the signal group data from 
the event detector, and scan pointer. In this example, packet scanout 3033 receives header 
5 information 3053, signal group data 3050 from event detector 3030, and scan pointer control data 
3056 from Out Scan Pointer logic 3044. Packet scanout 3034 receives header information 3054, 
signal group data 3051 from event detector 3031, and scan pointer control data 3057 from Out 
Scan Pointer logic 3044. Packet scanout 3035 receives header information 3055, signal group 
data 3052 from event detector 3032, and scan pointer control data 3058 from Out Scan Pointer 
^0 logic 3044. 

Ci The Out Scan Pointer 3044 is coupled to each of the packet scanout logic 3033-3035 via 

^ lines 3056-3058. An activation logic is provided in each of the packet scanout logic and a 
s J periodic control logic is provided in the Out Scan Pointer 3044 for each of the 2-bit groups - 
-fi [0:1], [2:3], [4:5], [6:7], and [8:9]. The periodic control logic is coupled to the activation logic 
0 15 in each of the packet scanout logic to activate each of the 2-bit groups in succession. First the 
2 [0:1], then the [2:3], then the [4:5], then the [6:7], then the [8:9], and finally returns back to 

[0:1] where the cycle repeats all over again. The same 2-bit group for all of the signal groups in 
all the packet scanout logic 3033-3035 are activated together simultaneously. Thus, the [0:1] data 
group in all of the packet scanout logic 3033-3035 is activated simultaneously while the other data 
20 groups are not activated. Next, the [2:3] data group in all of the packet scanout logic 3033-3035 
is activated simultaneously while all other data groups are not activated, and so forth. 

In one embodiment, the activation logic in each packet scanout logic is a simple AND gate 
where one input is the data input and the other input is a control input which receives a logic "1" 
from the periodic control logic for some time period and a logic "0" for another time period. 
25 For this example of a 10-bit packet, the periodic control logic outputs a logic " 1" to the control 
input of the AND gate once every 5 cycles for each of the data groups. So for one cycle, data 
group [0:1] in all of the packet scanout logic is activated while all other data groups are not 
activated. In the next cycle, data group [2:3] in all of the packet scanout logic is activated while 
all other data groups are not activated. This cycle continues for data groups [4:5], [6:7], and 
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[8:9]. 

Because the Out Scan Pointer 3044 is actually activating the same set of data groups (e.g., 
[2:3]) in all of the packet scanout logic for all signal groups 3050-3052, theoretically all of these 
activated data groups can be transmitted out to the next chip. But in this example, because only 2 
5 pinouts are available, additional logic is needed to select the particular signal group ([0:9], 
including the header), and hence the particular activated data group (e.g., [2:3]), that will be 
scanned out on those two pinouts in that packet cycle. 

Transmission Logic - Packet Scheduler 
^;|;iO In one embodiment, the packet scheduler uses a form of token ring technology to deliver 

Jt; the packets from one chip to another. Generally speaking, when a packet scheduler associated 

with a particular signal group receives a token and detects an event, the packet scheduler " grabs" 
ii the token and schedules the transmission of this packet in the next packet cycle. If, however, the 
packet scheduler receives the token but does not detect an event, it will pass the token to the next 
Q 15 packet scheduler associated with another signal group. At the end of each packet cycle, the 
1^. packet scheduler that grabbed the token will pass the token to the next packet scheduler associated 
]^ with another packet. 

With this implementation, the packet scheduler skips idle packets (i.e., those signal groups 
which did not change in value) and prevents them from being delivered to another chip. Also, 
20 this scheme guarantees that all event packets have a fair chance to be delivered to the other 
designated chip. 

Each packet scheduler receives an event input from its corresponding event detector and 
another input from the Out Scan Pointer 3044. Each packet scheduler is coupled to another 
adjacent packet scheduler so that all the packet scheduler is tied together in a circular loop 
25 configuration. Finally, each packet scheduler outputs a control output to a packet selector. 

In this example, packet scheduler 3036 receives an event input from event detector 3030 
via line 3062 and a scan pointer input from Out Scan Pointer 3044 via line 3065. Packet 
scheduler 3037 receives an event input from event detector 3031 via line 3063 and a scan pointer 
input from Out Scan Pointer 3044 via line 3066. Packet scheduler 3038 receives an event input 
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from event detector 3032 via line 3064 and a scan pointer input from Out Scan Pointer 3044 via 
line 3067. With these inputs, each packet scheduler knows whether its corresponding event 
detector has detected an event and which of the 2-bit data groups is currently active. 

The packet schedulers collectively are also tied together in a circular loop configuration 
5 for token ring passing. Packet scheduler 3036 is coupled to packet scheduler 3037 via line 3068, 
packet scheduler 3037 is coupled to packet scheduler 3038 via line 3069, and packet scheduler 
3038 is coupled to packet scheduler 3036 via line 3070. Thus, when a packet scheduler 
associated with a particular signal group receives a token and receives an event input from its 
corresponding event detector, the packet scheduler "grabs'' the token and schedules the 
':S10 transmission of this packet in the next packet cycle. If, however, the packet scheduler receives 
yj tjie token but does not receive an event input from its corresponding event detector, it will pass 

5;:? S 

M the token to the next packet scheduler associated with another signal group. A packet scheduler 
lI will only "grab" the token if it has also received a event input from its corresponding event 
^' detector. If there's no event, the packet scheduler will not "grab" the token; it will pass it on to 
Q 15 the next packet scheduler. At the end of each packet cycle, the packet scheduler that grabbed the 
M token will pass the token to the next packet scheduler associated with another packet. 
5^ Each packet scheduler 3036-3038 also outputs a control output 3071-3073 to the packet 

^""^ selector 3039. This control output dictates which of the packets among the signal groups have 
been selected for transmission across the chip's pinouts. 
20 How long does a packet scheduler grab the token before passing it to the next packet 

scheduler? The packet scheduler needs to grab the token for as long as necessary to transmit an 
entire packet. This implies that the packet scheduler must keep track of whether an entire cycle 
of data groups comprising the packet has been scanned out or not. How? Each packet scheduler 
receives mformation about the scanout pointers. Packet scheduler 3036 receives scanout pointer 
25 information via line 3065, packet scheduler 3037 receives scanout pointer information via line 
3066, and scheduler 3038 receives scanout pointer information via line 3067. 

When a packet scheduler grabs a token, it notes the information from the scanout pointer 
to determine which data group has been activated for scanout. As the Out Scan Pointer activates 
data groups m succession (i.e., [0:1], [2:3], [4:5], [6:7], and [8:9]), the packet scheduler notes 
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these scanout pointer information. When the packet scheduler notes that a fiiU cycle of data 
groups has been activated (and hence, the entire packet has been transmitted), the packet 
scheduler releases the token to the next packet scheduler. Remembering the particular data group 
at the time it grabbed the token allows the packet scheduler to determine whether a full cycle has 
5 passed, 

A more detailed view of the packet scheduler is shown in FIG. 97. Packet scheduler 3001 
receives the event detection indication from the event detector 3000 via line 3016. A D flip-flop 
3005 is provided which receives the event detection indication as the CLK input. Its D input is 
tied to a logic "1" source such as Vcc via line 3015. The output of the D flip-flop 3005 is 

^;flO provided to the token algorithm unit 3007 via line 3017. This output on line 3017 represents the 
event detection indicator. The value of this indicator is a logic "1" when the packet scheduler 

,p detects an event. It receives its reset input from the token algorithm unit 3007 via line 3018. So 

i2 long as a packet is being delivered, the event detection indicator on line 3017 should output a 

^ logic " r to the packet scheduler 3001 . 

13 15 The D flip-flop 3006 is used to indicate whether its associated packet scheduler 3001 is 

y the current token holder or not. D flip-flop 3006 receives an input from the token algorithm unit 
L; 3007 via line 3024, an enable input from the scan pointer 3008 via line 3019, and a clock input 
H via line 3023. The enable input on line 3019 is also the ScanEnd signal. The ScanEnd signal 
represents whether or not the last data group in the packet has been sent. Thus, if the last data 
20 group in the packet has been sent out, then ScanEnd = logic " 1 . " The D flip-flop 3006 outputs a 
Tk output on line 3026 and another output to the token algorithm via line 3025. Tk represents 
the current token value. If Tk= logic "1," then this packet scheduler is the current token holder, 
otherwise, Tk= logic "0." 

The token algorithm unit 3007 receives an input from the D flip-flop 3005 via line 3017, a 
25 Tki input on line 3021, a ScanStart input from the scan pointer 3008 via line 3020, and the output 
of D flip-flop 3006 via line 3025. The token algorithm unit 3007 outputs the reset signal to D 
flip-flop via line 3018, the Tko signal on line 3022, and the input to the D flip-flop 3006 via line 
3024. 

The token algorithm unit essentially answers these questions: Who is the current token 
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holder? Who is the next token holder? Should I be the token holder if the token comes my way? 
Should I pass the token to another? The token algorithm is as follows: 

R=ScanStart&Tk 
5 Tkn=Tki&Ev+Tk&Tki 

Tko=Tk+Tki&!Ev 

ScanStart is at logic "1" when the header has been sent, and logic "0" otherwise. 
ScanStart is delivered by the scan pointer 3008. Certain bit groups at the beginning of a packet is 
QlO designated for the header and the scan pointer logic can deliver this information to the token 
algorithm unit 300 1 . 

% ScanEnd is at logic " 1" if the last data group in the packet was sent out, and logic "0'' 

otherwise. Together, ScanStart and ScanEnd represent the beginning and end transmission of the 
In packet. 

g 15 "Tki" represents an input token. The packet scheduler is receiving a token from another 

^ packet scheduler. 

lU "Tko" represents an output token. The packet scheduler is passing this token to another 

1==^. packet scheduler. 

"Tk" indicates whether a any given packet scheduler holds the current token. This Tk 
20 value is communicated to the packet selector 3039 (see FIG. 98A) as the control signal in 

determining which signal group to select for scan out. When Tk= logic "1," the corresponding 
packet scheduler is the current token holder. 

""Tkn" represents the next token. If Tkn is at logic "1," the corresponding packet 
scheduler represents the next token holder. 
25 "Ev" represents an indication that an event has been detected. " !Ev" represents an 

indication that an event has not been detected. 

The "R=ScanStart&Tk" portion of the token algorithm guarantees that flip-flop 3005 will 
be reset so that the output 3017 will show a logic " 1 . " This is necessary because the packet 
scheduler, and hence the signal group, that grabbed the token needs to reset the event detector 
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flip-flop 3005 before sending the packet out. If it does not reset the flip-flop, it will attempt to 
grab the token for the next packet cycle. How is this accomplished? Because the header was 
sent, ScanStart= logic ''1." Tk= 1 also because the packet scheduler is the current token holder. 
Thus, R= 1, which resets the flip-flop 3005. 
5 The "Tkn=Tki&Ev+Tk&Tki" portion of the token algorithm attempts to determine who 

the next token holder is. If the given packet scheduler is receiving a token (Tki= 1) AND an 
event has been detected, then that packet scheduler is the next token holder. This is embodied by 
the "Tkn=Tki&Ev" portion of the Tkn token algorithm. 

In addition, if the given packet scheduler is also the current token holder and it is also 
QlO receiving the token (because no other packet scheduler wants the token), then this packet 

scheduler will continue to be the token holder. It is also the "next" token holder. This is 
J embodied by the "Tkn=Tk&Tki" portion of the Tkn token algorithm. 
'^^ The "Tko==Tk-hTki&!Ev" attempts to determine whether the given packet scheduler 

in should pass the token to the next packet scheduler. First and foremost, the given packet 
0 15 scheduler cannot a pass a token to another if it does not have the token. Thus, if the given packet 
2 scheduler is the current token holder, it will also output the token to another packet scheduler. 
W This is embodied by the "Tko=Tk" portion of the Tko token algorithm. 
|n& In addition, if the given packet scheduler is receiving a token from another but it has not 

detected an event, then this packet scheduler does not need the token and should pass it to another 
20 packet scheduler. This is embodied by the "Tko=Tki&!Ev" portion of the Tko token algorithm. 

Transmission Logic - Packet Selector 
The packet selector serves as one big multiplexer which receives packet data at its data 
inputs and control input from the packet scheduler to select which of the many packet data to 
25 select for output across the chip's pinouts. The packet selector 3039 receives the packet data via 
lines 3059-3061 and control input from each of the packet schedulers 3036-3038. Thus, packet 
selector 3039 receives packet data from packet scanout 3033 via line 3059 and its corresponding 
control input 3071 from packet scheduler 3036. Packet selector 3039 receives packet data from 
packet scanout 3034 via line 3060 and its corresponding control input 3072 from packet scheduler 
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3037. Packet selector 3039 receives packet data from packet scanout 3035 via line 3061 and its 
corresponding control input 3073 from packet scheduler 3038. 

Based on the packet scheduler's own algorithm of determining whether an event has been 
detected and whether it has received a token, the packet scheduler outputs a control data to the 
5 packet selector 3039. If packet scheduler 3036 has received an event detection indication from 
the event detector 3062 and has received a token, the packet scheduler 3036 grabs the token and 
outputs control output to the packet selector 3039 via line 3071. This alerts the packet selector 
3039 to select the data on line 3059 for output across the chip's pinouts. Just as control 3071 is 
associated with packet data on line 3059, control 3072 is associated with packet data on line 3060 
':^10 and control 3073 is associated with packet data on line 3061 . 

Wy The packet scheduler that has grabbed the token will make sure to keep its control output 

to the packet selector active until the entire every data group in the packet has been scanned out 
and transmitted across the chip's pinouts. Using pinouts 3075, the packet scheduler outputs the 
W packet, data group by data group. Here, the packet is represented by reference number 3074, 
□ 15 where a header and four data groups are shown. In this example, each data group is 2 bits since 
2 there are only 2 pinouts. The header is output first, followed by each of the 2-bit groups that has 
^ been scanned out by the Out Scan Pointer 3044. 

Transmission Timing 

20 In one embodhnent of the present invention, the transmission of a selected N-bit signal 

group (through token passing) via the plurality of M-bit data groups occurs during one evaluation 
(i.e., EVAL period) cycle. The scanO pointer for the header is enabled for one clock period. 
Then, the EVAL period begins where each successive M-bit data group is transmitted during 
each successive clock cycle. During this EVAL period, the Tkn value is calculated to determine 

25 the next token holder. At the conclusion of the scan-out of the last scanned M-bit data group 
(e.g., [8:9] in the example above), the EVAL period will terminate. At this point, the token 
values among the packet schedulers will be updated. 

Reception Logic — Overview 
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The purpose of the reception logic is to receive the packets and distribute the packet data 
to their designated connections in the hardware model realized in this particular chip. Once the 
packet data reaches their destination, the data can be processed by the hardware model The 
enthe movement of data from one chip to another chip allows the hardware model to process the 
5 data as if no separation occurred due to the memory Ihnitations of FPGA chips. While the 
transmission logic scans out the data 2 bits at a tune from the first chip, the reception logic 
receives and scans in the data 2 bits at a time to the appropriate separated connections in the 
second chip. 

Referring now to FIG. 98B, the chip boundary is shown on the left side of the figure. 
yiO Once again, using the same example as above, this chip has only 2 pinouts 3075 dedicated for 
sS inter-chip communication. Line 3075 branches into lines 3076-3079. Line 3075 routes header 

data to a header decode unit 3040. Lme 3077-3079 route data groups to packet scan-in units 
. 3041-3043. Depending on which data group has been activated for scan-in, the data groups are 
Ul scanned in one by one until the entire packet has been delivered. 

2 Reception Logic - Header Decode 

The header decode unit 3040 makes sure that the packets are delivered to the appropriate 
M packet scan-in units. For example, packets from signal group SO on the transmission side should 
end up at signal group SO on the reception side; that is, the signals from the separated 
20 connections on one chip should be delivered to the corresponding separated connections on the 
other chip. 

The header decode unit 3040 receives header information via line 3076. Line 3076 
branches off from line 3075 which contains all the data groups that have been received in the 
chip. The header decode unit also receives all the data groups but because the In Scan Pointer 
25 3045 in the reception logic of this second chip is synchronized with the Out Scan Pointer 3044 in 
the transmission logic of the first chip (see FIG. 98A), the header decode knows which data 
group is the header and which are payload data groups. Note that the header decode unit 3040 
receives scan pointer information from the In Scan Pointer 3045 via line 3089. 

When the header decode unit 3040 captures the header for this received packet, it decodes 
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the header information and now knows which signal group (e.g., SO, SI, S2) this packet belongs 
to. The header decode unit 3040 outputs control signals to the packet scan-in units 3041-3043 via 
Imes 3086-3088, respectively. If the packet belongs to signal group SO, the header decode unit 
3040 will enable packet scan-in unit 3041 via line 3086. If the packet belongs to signal group SI, 
5 the header decode unit 3040 will enable packet scan-in unit 3042 via line 3087. If the packet 
belongs to signal group S2, the header decode unit 3040 will enable packet scan-in unit 3043 via 
line 3088, 

Reception Logic - Packet Scan-In Unit 
^' JIO The packet scan-in unit in the reception logic works analogously like the packet scan-out 

%0 unit in the transmission logic. A packet scan-in unit is provided to scan in the appropriate number 
,g of data groups within a signal group. In this example, the number of pinouts is 2, so the 8-bit 
s 2 signal group (and the 2-bit header) is divided up into 2-bit data groups since the reception logic is 
^ designed to receive 2 bits from the transmission logic in the other chip due to the 2 pinouts. 
0 15 Thus, 5 scan-in cycles are needed to receive the entire 10-bit packet (signal group and header), 
2 First the header [0:1], then the next two bits [2:3], then the next two bits [4:5], then the next two 
Jy bits [6:7], and finally the final two bits [8:9]. 

1-^ A packet scan-in unit is provided for each of the signal groups. In this example, three 

packet scan-in units 3041-3043 are provided to support the three signal groups 3083-3084. Each 

20 packet scan-in unit receives the header information, the data groups forming the packet fi*om the 
transmission logic in the other chip, a control signal from the header decode unit 3040, and a 
scan pointer. In this example, packet scan-in 3041 receives data groups on line 3077, control 
signals Irom the header decode unit 3040 on line 3086, and scan pointer control data 3080 from 
In Scan Pointer logic 3045. Packet scan-in 3042 receives data groups on line 3078, control 

25 signals from the header decode unit 3040 on line 3087, and scan pointer control data 3081 from 
In Scan Pointer logic 3045. 

The In Scan Pointer 3045 is coupled to each of the packet scan-in units 3041-3043 via 
lines 3080-3082. An activation logic is provided in each of the packet scan-in unit and a periodic 
control logic is provided in the In Scan Pomter 3045 for each of the 2-bit groups - [0:1], [2:3], 
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[4:5], [6:7], and [8:9]. The periodic control logic is coupled to the activation logic in each of the 
packet scan-in unit to activate each of the 2-bit groups in succession. First the [0:1], then the 
[2:3], then the [4:5], then the [6:7], then the [8:9], and finally returns back to [0:1] where the 
cycle repeats all over again. The same 2-bit group for all of the signal groups ui all the packet 
5 scan-in units 3041-3043 are activated together simultaneously. Thus, the [0:1] data group in all 
of the packet scan-in units 3041-3043 is activated simultaneously while the other data groups are 
not activated. Next, the [2:3] data group in all of the packet scan-in units 3041-3043 is activated 
simultaneously while all other data groups are not activated, and so forth. 

In one embodiment, the scan-in unit is accomplished by flip-flops with enable pins 
Clio controlled by scan pointers. In the given example of 2 header bits and 8 data bits, the scan-in 
u unit comprises 8 flip-flops. The 1st and 2nd flip-flops are enabled by scan pointer 1, which 

latches in bit [2:3]. The 3rd and 4th flip-flops are enabled by scan pointer 2, which latches in bit 
3 7 [4:5]. The 5th and 6th flip-flops are enabled by scan pointer 3, which latches in bit 
m [6:7]. The 7th and 8th flip-flops are enabled by scan pointer 4, which latches m bit 
pl5 [8:9]. Also the header decode unit has two flip-flops which capture the header bits [0:1] 
?Z by scan pointer 0. 

The In Scan Pointer 3045 is synchronized with the Out Scan Pointer 3044. Thus, when 
I J. data group [0:1] has been scanned out by the transmission logic in the first chip, the same data 
group [0:1] has been scanned in the reception logic in the second logic. 

20 

Inter-Chip Communication Logic - Summary 
The complexity of user designs, the limited capacity of FPGA chips, and the limited 
number of chip pinouts have resulted in the development of inter-chip communication technology 
that necessitates the transfer of a large amount of data across a limited number of pins m the 
25 shortest amount of time. One embodunent of the present mvention is an inter-chip 

communication system that transfers signals across FPGA chip boundaries only when these 
signals change values. Thus, no cycles are wasted and every event signal has a fair chance of 
achieving communication across chip boundaries. 

In one embodiment, the inter-chip communication system includes a series of event 
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detectors that detect changes in signal values and packet schedulers which can then schedule the 
transfer of these changed signal values to another designated chip, Workuig with a plurality of 
signal groups that represents signals at the separated connections, the event detector detects 
events (or changes in signal values). When an event has been detected, the event detector alerts 
5 the packet scheduler. 

The packet scheduler employs a token ring scheme as follows. When the packet scheduler 
receives a token and detects an event, the packet scheduler ''grabs" the token and schedules the 
transmission of this packet in the next packet cycle. If, however, the packet scheduler receives 
the token but does not detect an event, it will pass the token to the next packet scheduler. At the 
^iO end of each packet cycle, the packet scheduler that grabbed the token will pass the token to the 

next logic associated with another packet. 
,p With this implementation, the packet scheduler skips idle packets (i.e., those signal groups 

i2 which did not change in value) and prevents them from being delivered to another chip. Also, 
^'^ this scheme guarantees that all event packets have a fair chance to be delivered to the other 
0 15 designated chip. 

U, ' Depending on the number of pinouts that are dedicated for inter-chip communication, scan 
IM out pointers are used in the transmission side and scan-in pointers are used in the reception side. 
So, if only two wires are available across the chips' boundaries, then the data group of 2 bits are 
scanned out in sequence until the entire packet has been transmitted. Because the scan out logic 
20 and scan in logic are both synchronized together, both the transmission side and reception side 
know which data group is being delivered across the chips' boundaries. 

At the reception side, a header decode unit is provided to determine which signal group a 
packet belongs to. The header decode unit then ensures that the packet is delivered to the 
appropriate logic supporting that signal group. 

25 

I. BEHAVIOR PROCESSOR SYSTEM 

In accordance with another embodiment of the present invention, a novel Behavior 
Processor provides a unique architecture for implementing behavior applications, such as 
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monitors, triggers, and memory server. One embodiment of the present invention is a Behavior 
Processor that is integrated with the RCC computing system (the host workstation containing the 
software model of the system design) and the RCC hardware array (emulator containing the RTL 
hardware model). With this configuration, behavioral aspects of the user's design and debug 
5 session are implemented in hardware to accelerate the debug process. Whenever certain 

conditions are satisfied as programmed into the Behavior Processor, a callback trigger signal is 
generated and delivered to the workstation to alert the user and software model. In the past, 
these behavioral aspects were implemented in software which provided a major bottleneck in the 
design verification process. 

filO 

& 

111 BACKGROUND 

vj A brief background will now be provided. A hardware-based language, such as VHDL, 

Q serves as a description language of the input data to synthesis tools. In the context of software 
j;^ 15 tools and VHDL, synthesis is a method of convertmg a higher level abstraction (e.g., a 
a behavioral description) to a lower level abstraction (e.g., a gate-level netlist). Users can write 
m code for simulation and code for synthesis. When writing code for simulation, almost everything 
is possible ft-om conditional constructs (e.g., wait, delay, while loops, for loops, if-then-else 
loops) to simple calculations since simulation is performed in software. 
20 Most problems and issues that arose during the initial attempts at using synthesis tools 

were caused by the restrictions to VHDL which make only a subset of VHDL elements available 
for synthesis. The restrictions are based on the lack of a hardware-equivalent of a VHDL 
element or the limited capabilities of the synthesis tool. In other words, the user is constrained 
quite a bit when it comes to writing synthesis code. 
25 Code for synthesis suggests that VHDL code is being written for placement of design 

elements in some logic device such as a CPLD or an FPGA. Not all simulation code elements 
can be reduced to synthesis code elements. Thus, VHDL elements that are adequate for 
simulation can be useless for synthesis because of the lack of any corresponding hardware 
equivalents for implementing them. These are, for example, specifications of signal delays which 
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depend on temperature, the fabrication process, and the supply voltage, and cannot be adjusted 
because of the wide range of these parameters. Other examples are the initial values in the signal 
or the variable declarations. After a power-on of a chip, the initial values of the data are random 
ones. During synthesis, this information is ignored. Circuit specifications which are not 
5 accepted by synthesis tools may cause an abort of the synthesis process. This will happen if 
actions are specified to be triggered by the condition that the edges of two signals occur at the 
same time. Because no technology library contains such a flip-flop which is triggered on two 
sunultaneous edges of clock signals, this VHDL construct is not allowed for synthesis purposes, 
although it may be allowed for simulation purposes. 
yiO That means that users cannot often insert WAIT statements, depend on both edges of an 

ifl event as a trigger, or insert other things that they could do if they were coding for simulation 

111 

J2 only. Even where certain code styles are allowed, users will not necessarily be able to synthesize 
,2 a particular design in a compact way. 

in When code for synthesis has been generated, the user can use synthesis tools to reduce the 

C3 15 code into silicon. As suggested above, synthesis tools are programs that prepare VHDL code that 
'u, users have written for implementation in an FPGA device. These tools take either behavioral or 

structural VHDL code, transform the code into FPGA primitives or "standard native 
N components" specific to the device, and ultimately yield a gate level netlist file which can be used 
in an FPGA place-and-router burner. Along the way, several steps are involved including 
20 compilation into a preliminary design compiler format, optimization for area or speed, 

specification of constraints such as pre-assigned pin placements and delay targets, and final 
extraction into a netlist file or into a "back-annotated" delay file. 

To illustrate a conventional implementation, refer to FIG. 99 which shows a high level 
debug environment. In FIG. 99, a workstation 3100 is interfaced to a hardware emulator 3103 
25 which contains the RTL hardware model of the user's circuit design. A test bench process 3101 
in software provides test bench data via line 3104a for the hardware model in the emulator 3103 
to process. When a set of test bench data has been received and processed by the hardware 
emulator 3103, the results of this processing must be checked for accuracy by comparing them to 
expected results. Thus, the actual results are fed back to a software checker 3102 via line 3104b. 
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While the checker 3102 is checking the results of the hardware emulator's processed data, the 
debug session is halted momentarily. If the results match favorably with expected results, then 
the workstation can instruct the test bench 3101 to deliver more test bench data to the hardware 
emulator. Based on the results of the hardware model's processing of these test bench data, users 
5 can determine whether their circuit designs are "working" or not. 

Because of the need to check results during the debug session, this type of set-up must 
make frequent stops to allow the checker 3102 to receive and check the processed results. 
Because the checker is in software, this set-up significantly slows down the debug session. 
In addition to processed results, the checker 3102 also performs other conditional 
SlO operations inchiding "While... Do" loops, "If... then... else" loops, "For" loops, and the like. 
v3 Time-based conditional instructions like WAIT, FORK, and DELAY are also conditional 
operations. Although these behavioral or conditional instructions can be implemented in 
hardware, it is very difficult to do so and the necessary hardware logic takes up a lot of space on 
in the FPGA chip, which should be reserved for user design modeling. However, these conditions 
□ 15 are easy to implement in software. So, the checker 3102 also includes these conditional 
]j instructions. When the variables in these behavioral conditional loop instructions are processed 

ry by the hardware emulator, the checker 3102 checks to see whether or not the specified conditions 

O 

H have been satisfied to further perform other operations. As a resuU of these mmierous 

conditional instructions in software, the debug session speed is slow because the processing has to 

20 be stopped at each iteration for the checker 3102 to check these conditions. 

Furthermore, the set-up of FIG< 99 also includes the processmg of other behavioral 
instructions that are not loop-based but which are processed in software because the hardware 
emulator 3103 has no place for them. Such instructions include $MONITOR, $DISPLAY, and 
SPRINT. So, the user has to manually mask these instructions in software by simply prepending 

25 a "#" character in front of these instructions to mask them. Accordingly, the set-up of FIG. 99 
will not attempt to send any data to the monitor or print any data on the printer when these 
instructions are encountered. Such manual intervention by the user or the need for additional 
"massaging" of the code provides a less than optimum environment for the user during a debug 
session. 
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In short, traditional accelerators and emulators do not address "behavior" functions in 
hardware and can only speed up synthesized RTL and gate-level netlist. One embodiment of the 
present invention provides a system that generates hardware elements from normally non- 
synthesizable code elements for placement on an FPGA device. This particular FPGA device is 
5 called a Behavior Processor. This Behavior Processor executes in hardware those code constructs 
that were previously executed in software. When some condition is satisfied (e.g., If... then... else 
loop) which requires some intervention by the workstation or the software model, the Behavior 
Processor works with an Xtrigger device to send a callback signal to the workstation for 
immediate response. 

10 

HIGH LEVEL BEHAVIOR PROCESSOR SYSTEM 

FIG. 100 shows a high level co-modeling environment in accordance with one 
embodiment of the present invention. A host workstation 3106 is coupled to the RCC hardware 
15 accelerator 3107. This RCC hardware accelerator 3107 has been described in other sections of 
this present patent specification. A board 310 is coupled to the RCC hardware accelerator 3107. 
This board 3109 contains a Behavior Processor 3109a and an internal memory 3109b. The 
Behavior Processor 3109a is designed with Verilog RTL. 

The Behavior Processor 3109a uses the program memory 3109b available (also known as 
20 internal memory) inside the FPGA to execute code stored in the program space. The program 
memory 3109b can be dynamically loaded during runtime given the RCC hardware accelerator's 
ability to re-program the Behavior Processor during runtime. In one embodiment, the size of the 
internal memory 3109b for Altera 10K250 is 40Kbits. 

Although FIG. 100 shows the RCC hardware accelerator 3107 separate from the board 
25 3109, another embodiment provides for the realization of the board in the RCC hardware 

accelerator 3107. By mapping the Behavior Processor in the FPGA hardware, the overall design 
verification system set-up dramatically increases the performance of behavior functions that were 
normally handled by a software application running in the workstation. As mentioned above, a 
concept behind the Behavior Processor is the fact that "behavior" functions in Verilog Language 
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constructs such as $MONITOR and trigger conditions can be implemented in hardware and 
therefore, the system accelerates these behavior functions in parallel. The Behavior Processor 
can work with other Behavior Processors in the form of other FPGAs, such as FPGA 3108, In 
fact, the Behavior Processor can be instantiated as many times as needed in hardware and is only 
5 limited by hardware availability. 

In one embodiment, the Behavior Processor is coded with Verilog RTL and is synthesized 
and mapped into Altera 10K250 FPGA. However, the Behavior Processor architecture is not 
limited to the Altera technology. In fact, the power of Behavior Processor is scalability — as 
faster and better FPGA technology is available, the Behavior Processor can take advantage of 
'io these fast moving technologies. In one embodiment of the present invention, the Behavior 
a Processor is running at the hardware speed of the FPGA. In one embodiment, that speed is 
,g 20MHz clock frequency. 

b 15 BP INTERFACE WITH RCC 

n FIG. 101 shows the Behavior Processor 3110 and its interfaces in accordance with one 

embodiment of the present invention. The Behavior Processor 3110 itself is an FPGA logic 
M device that can be programmed to provide any desired ftinction(s) as known to those ordinarily 
skilled in the art. The Behavior Processor 3110 includes a set of inputs 3111 and a set of outputs 
20 3112, an END interface 3113, a START mterface 3114, a WAIT interface 3115 and a FAST 
CLK interface 3116. FIG. 102 below will illustrate how the Behavior Processor 3110 interfaces 
with other elements of the RCC hardware system. 

As mentioned above, one embodiment of the present invention integrates a Behavior 
Processor in the RCC hardware accelerator to provide hardware functionality of traditionally non- 
25 synthesizable HDL code elements. FIG. 100 shows the Behavior Processor integrated with the 
RCC hardware system in accordance with one embodiment of the present invention. The system 
controller 3120 represents the main system controller unit in the RCC hardware accelerator that 
controls the traffic into and out of the RCC FPGA array. This system controller 3120 is also the 
CTRL FPGA unit 701 in FIGS. 22 and 23. The system controller 3120 also generates control 
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signals as necessary to provide the traffic controller functionality. The RTL 3121 represents the 
hardware model of the user's design that is modeled in the hardware accelerator's array of FPGA 
devices. 

As shown in FIG. 102, the Behavior Processor 3110 receives the END and START 
5 signals on lines 3113 and 3114, respectively, from the system controller 3120. The Behavior 
Processor 3110 also receives input data on line 3111 from the RTL 3121. Moreover, the 
Behavior Processor 3110 receives the FAST CLK signal on line 3116 for use as the clock 
reference. In one embodiment, this clock speed is 20 MHz. As for outputs, the Behavior 
Processor 3110 provides a WAIT signal to the system controller via line 3115 and an output to 

J^flO the RTL 3121 via Ime 3112. These signals will be discussed further below. 

v3 The system controller 3120 and the RTL 3121 also communicate with each other, of 

i course. The system controller 3120 delivers the EVAL signal to the RTL 3121 via line 3122. 

i2 The RTL 3121 delivers the EVAL REQ signal to the system controller via line 3123. These 

^- " signals will be discussed further below. 

□ 15 

I BP TIMING DIAGRAMS 

FIG. 103 shows a timing diagram of the relevant interfaces of the Behavior Processor in 
accordance with one embodiment of the present invention. Remember, the Behavior Processor 

20 3110 (see FIG. 102) is designed and programmed to perform certain operations (e.g., conditions, 
loops). These operations are typically behavioral m nature and were previously executed in 
software only. Now, the Behavior Processor 3110 performs these operations in hardware. Note 
that the Behavior Processor 3110 is not limited to behavioral operations only. For various 
reasons, the Behavior Processor 3110 can execute traditionally behavioral operations to 

25 traditionally non-behavioral operations to a combination of behavioral and non-behavioral 
operations. For example, an entire microprocessor can be programmed and created in the 
Behavior Processor 3110 instead of in the arrays of FPGA logic devices in the RTL hardware 
model 3121. 

Typically, many instructions within a shnulation time are executed within an EVAL 
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period. As explained previously, the EVAL_REQ_N (or EVAL_REQ#) signal is used to start 
the evaluation cycle if any of the FPGA logic devices asserts this signal. The EVAL_REQ# 
signal is used to start the evaluation cycle all over again if any of the FPGA chips asserts this 
signal. For example, to evaluate data, data is transferred or written from main memory in the 
5 host processor's computing station to the FPGAs via the PCI bus. At the end of the transfer, the 
evaluation cycle begins including address pointer initialization and the operation of the software 
clocks to facilitate the evaluation process. As the various EVAL REQ signals are asserted by 
various FPGA logic devices in the RTL hardware model 3121, contention results. The resolution 
of the contention results in the generation of the EVAL signal by the system controller 3120. 
^10 Thus, at time tl, the EVAL signal goes logic "1" as the EVAL REQ goes to logic " L" 
vll When the system controller 3120 asserts the EVAL signal to the RTL hardware model 

}z 3121, it also asserts the START signal to the Behavior Processor 3110 to signal the Behavior 
{2 Processor 3110 to start executing those instructions that it is programmed to execute. 

Concurrently, the Behavior Processor 3110 receives relevant data from the RTL hardware model 

Si 

0 15 3121 via line 311L Relevant output data is also delivered from the Behavior Processor 3110 to 
il the RTL hardware model 3121 via line 3112. The Behavior Processor 3110 processes data at the 
;H clock speed of the FAST CLK on line 31 16. In one embodiment, this speed is 20 MHz. 
N After receiving the START signal at tune tl, the Behavior Processor 3110 asserts the 

WAIT signal at time t2 and processes relevant data that it receives. The WAIT signal is asserted 
20 for as long as the Behavior Processor 3110 is processing data. In essence, the Behavior 

Processor 3110 is telling the system controller 3120 to "wait" for the Behavior Processor 3110 to 
process its data before the system controller 3120 decides to transfer more data into or out of the 
RTL hardware model 3121. When the Behavior Processor 3110 has completed its execution, it 
deasserts the WAIT signal. In FIG. 101, this deassertion occurs at time t3. 
25 Even though the Behavior Processor 3110 has completed processing its own set of data, 

the RTL hardware model 3121 may still be processing its own set of data which does not involve 
the Behavior Processor 3110. Thus, the EVAL signal is still asserted by the system controller 
and the RTL's EVAL REQ signal is still asserted by the RTL hardware model 3121. As 
described elsewhere in this patent specification, dynamic evaluation logic is implemented so that 
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the EVAL period is either extended or shortened depending on whether the data in the RTL 
hardware model 3121 has stabilized. If it has not stabilized yet, the EVAL period is extended. If 
the data has stabilized, the EVAL period ceases as soon as possible. 

When the RTL hardware model 3121 has completed its evaluation of data, it deasserts the 
5 EVAL REQ signal at time t4. When both the WAIT signal and the EVAL REQ signals have 
been deasserted (e.g., logic "0"), the system controller 3120 asserts the END signal, as shown in 
FIG. 103 at time t4. This END signal enables the latching of the last stable data in the RTL 
hardware model 3121 by the Behavior Processor 3110. The deassertion of the END signal 
coincides with the deassertion of the EVAL signal at tune t5. 
qIO When only one (or none) of the WAIT signal and the EVAL REQ signals have been 

% deasserted, then the END signal is not asserted by the system controller. If the EVAL REQ 
W signal has been deasserted, this indicates that the RTL hardware model 3121 has completed its 
H evaluation of data. However, if the WAIT signal has not been deasserted by the Behavior 
if% Processor 3110, the Behavior Processor 3110 is still busy processing its own set of data and the 
15 system controller extends the EVAL period to allow the Behavior Processor 3110 to complete its 
^0 job. 

1 U FIG. 104 shows another timing diagram of the relevant interfaces of the Behavior 

rt Processor in accordance with one embodiment of the present invention. FIG. 104 is similar to 

FIG. 103 except that here, the Behavior Processor 3110 processes data at two different time 
20 periods - first between tunes t2 and t3 and second between times t4 and t5. In any event, the 

END signal is not asserted unless both the EVAL REQ and WAIT signals are deasserted to logic 

"0." 

25 BP LANGUAGE 

The following TABLE A represents the language used by the Behavior Processor which 
the user can use to control its operation: 

TABLE A: BEHAVIOR PROCESSOR LANGUAGE 
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rLanguage 


I/O 


Description 


behclk 


0 


20 MHz sysclk in RCC 
This dock is provided to the user to allow his 
behavior processor to accomplish its processes within 
the sunulation cycle. Many 20 MHz clock cycles are 
provided within a single simulation evaluation cycle. 


behstart 


o 


Beginnmg of the evaluation cycle of the simulation; 
beginning of the 20 MHz behavior processor 
timestep; becomes 1 for 1 behavioral clock cycle in 
the RCC during evaluation. See START in FIGS. 
103 and 104. 


behend 


0 


End of the evaluation cycle of the simulation; End of 
the 20 MHz behavior processor timestep (EOT); no 
more events; old value for next evaluation; avoids 
race conditions. See END in FIGS. 103 and 104. 


behwait 


I 


Behavior processor is still busy. This signal 
indicates to the system controller in the RCC to 
"wait" for the behavior processor to complete its 
processmg. See WAIT in FIGS. 103 and 104, 


$axis_set_behc 


NA 


Used for debugging the behavior processor. 

"timescale" of Ins/lOps. 

$axis set behc(codebugy maxsteps) 

(1, 1000) 

(0, 1000) 


axis_behctrl 


NA 


Any module with "axis behctrl" prunitive: 

All logic in that module goes to a single FPGA for 

place and route; 

If enough resources are available, the compilation 
should not get "no fit" 
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T .atjffuaffe 


I/O 


Description | 




A vie cmpm 


NA 


vyiiv-- ciAio oixi^iii 111 v/iiw u^^iici V iiji cii ui ut/Wo^vi . 








■ kjXlq iiiiiiiaiioii in rccgen is iiiai ii is less inaii hu 








TTViit ^lifjrrlwjirp-Hpnpnrlpnt nnml^pr^ — 90 "v 9 T^Hit* 

XvUil V,-LlCll U- W Cll t p VllU-wlll XlUillUC^l / aAJ A ^ XVUlL^ 








1 X 2K 2 X 4K 4x^512 8x256 








H Onp "nnrt nnlv ^pliir* pnriVtlp /T^T^^ 1 • ma<;V -'^ 

^ VVllC UWl L will y ^^111 U tilU-Ult \^V_^ J_v ^ ^ X , llJLCloJ^, ^ 








11' rik - > behclk^i 

1 V^lJV ^ LrV^llWl^y I 








■ Program only (Addr, Doot, Din, We). 










Q 






B Posed se behclk 








■ No gating 








■ No asynchronous reset 








■ Always@posedge elk 








■ Synchronous reset only 











C3 Another usefiil aspect of the behavioral processor technology is that the user can mix user 

RTL logic in the behaviral processor. The following language can be used, for example: 

5 module behP(en, data, elk) 

axis_behctrl (behclk, -,-,-); 

always@(posedge, elk, rst...) 

10 XTRIGGER PROCESSOR 

FIG. 105 shows the Behavior Processor modeled as an Xtrigger processor in accordance 
with one embodiment of the present invention. In one embodiment, the Xtrigger processor 3130 
includes a set of inputs 3135, a set of trigger status outputs 3136, an arithmetic logic unit (ALU) 
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3131, memory 3132, a set of counters 3134, and a control unit 3133. 

In one embodiment, the Xtrigger processor is progranomed to monitor internal signals and 
generate a trigger to the RCC system. Once the triggering condition is satisfied, the Xtrigger 
processor can pause the execution of the RCC and make service request from external sources. 
5 Among other things, the Xtrigger processor allows the user to: 

■ monitor and detect signal conditions in the RCC engine 

■ change conditions on the fly during emulation runs 

■ specify conditions in an easy, flexible, and powerful way, and 
^SlO ■ evaluate conditions quickly in the RCC 



TRIGGER EXAMPLE 
J^^' A simple trigger example is the following: 

H module model (a, b, c); 
f5 input a, b; 
^ output c; 

20 wire trigger; 

wire [0:7] status; 

Xtrigger #2 proc(trigger, status, {a, b}) 
25 endmodule 



TRIGGER LANGUAGE 

The trigger processor can be programmed to monitor conditions on input signals using the 
trigger language. The trigger language can be used to accomplish the following tasks: 
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Specify complex conditions 
Model state machines 
Control flow within each state 
Communicate to and from the design space 

The trigger language has the following program structure: 



Declarations 
^SlO ■ signal name(width) 



Statements 

■ if-elseif-else - a conditional execution of action 
I ■ goto - move from one state to another 

15 ■ programmable counters (32 bits) 

load counter = value - loads a counter value 
increment counter - increments the current counter by 1 
decrement counter - decrements the current counter by 1; stops at 0 
test for counter = =0 or ! = 
20 ■ programmable on-bit flag 

setflag - sets the general purpose flag to 1 
resetflag - resets the general purpose flag to 0 
test for flag= =0 or flag= = 1 

■ communicate to the design space 
25 oflag = value (8-bit wide) 

trigger = value (0 or 1) 

■ expressions 

- = , ! = ,&&, IL ! 
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■ language is case-sensitive 

The BNF notation for the trigger language is as follows: 

5 Prog:: States 

States:: States State | State 

State:: STATE name : Action ';' 

Action:: LOAD Counter = Nximber ';' 
I DECREMENT Counter ';' 
JilO ! SETFLAG 

j2 I RESETFLAG ';' 

2 I GOTO Name';' 

l2 1 Assign ';' 

^ I Ifstmt ';' 

015 I '{' Actions '}' 

a 

i.^ Actions:: Actions Action | Action 

X Assign: : OFLAG ' = ' Number 
H= I TRIGGER ' = ' Number 

> 

20 Counter: : COUNTERO | COUNTERl 
Ifstmt:: IF '(' Expr ')' Action 

I IF '(' Expr Action Elsestmt 

I IF '(' Expr ')' Action Elseifstmts 

I IF '(' Expr ')' Action Elseifstmts Elsestmt 
25 Elsestmt:: ELSE '(' Expr ')' Action 
Elseifstmt:: ELSEIF '(' Expr ')' Action 
Expr:: Expr '&&' Expr 

I Expr ' 1 1 ' Expr 

I Expr ' = = ' Expr 
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1 Expr '! = ' Expr 
I number 

I INPUTJIGNAL 
Number:: DECIMAL_NUM | binary | hex | octal 
Binary:: "b' [0-1] + 
Hex:: "h' [a-fO-9] + 
Octal:: "o' [0-7] + 



:^;iO EXAMPLES USING THE TRIGGER LANGUAGE 

jO Here are two examples using the trigger language: 

lT First example: 

Signal a(10); 
P15 Signal b(10); 
2 Signal Clk(l); 

=* State sO: { 

Trigger = 0; 
20 If(a==4){ 

Load counterO = 30 
Goto s2 

}} 

25 State s2: { 

If(counterO = = 0) 

{ 

if(b = = 5) 

trigger = 1 
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goto sO; 



} 

if(clk ==1) 
{ 

decrement counter 0; 
goto s3; 

}} 

State s3: { 

if(clk = = 0) 
goto s2; 



Second example: 
Signal a(10); 
Signal b(10); 

State sO : 
{ 

if(a = = 1 && b = = 5) 
goto si; 

} 

State si : 
{ 

if(a = = 5 && b = = 10) 
trigger = 1; 
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# • 

else 

goto sO; 

} 

5 CALL TESTBENCH PRIMITIVE (axis tbcall) 

As described above, the behavior processor provides a hardware-based "interrupt" -like 
control. When some condition (as defined by the user based on his user design) is satisfied 
within the behavior processor, it sends a control signal back to the RCC system and any testbench 
processes. To provide I/O services and system controls during hardware emulation mode, one 
DlO embodiment of the present invention includes a call testbench primitive, axis_tbcalU that lets the 
vO user use a hardware signal to call a software task during hardware emulation. The task is then 

executed in software in the RCC workstation. 
, 7 The syntax for the axis_tbcall primitive is as follows: 

pis axis_tbcall(/n^g^r^i^/2a/, ''task_to_execute'')\ 

[y The trigger jignal must be a scalar signal in the DUT (Device Under Test) that triggers 

the task call. When trigger_signal goes posedge, the taskjojexecute is called. The 
task Jo execute is a local Verilog task defined in the current module scope. Note that the 
20 following statements are not allowed: 

_ wait 
_ ©event 
_ #delays; 

25 

Axis primitive axis tbcall example 

Here is an example of one of the many ways to use axis^tbcall. In the following program, 
the purpose is to use axis_tbcall to $DISPLAY from the DUT (Device Under Test) during 
emulation mode: 
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'timescale Ins/lOOps 

module dut; 
5 reg[63:0]dL; 
reg [6:0] dS; 
reg [1:0] cnt; 
reg enE; 
wire enP; 

% wire elk; 

T. glS 

H axis_clkgen #(1 , 20) (elk); 

%. 15 always@(posedge elk) begin 

a dL < = dL + 64'hOOOlOOOlOOOlOOOl; 

ni dS<=dS + l; 

2 cnt < = cnt + 1; 

if(cnt = = 0) begin 
20 enE < = ~enE; 

end 
end 

initial begin 
25 dL = 0; 

dS = 0; 
enE = 0; 
cnt = 0; 

$axis_set_clkgen("clk", 0, 5, 10, 10); 

205 

SV/225583,01 
16503.302504 



end 

axis_pulse(enP, enE); 
axis_tbcall(enP, "tl"); 

5 

tasktl; 

reg [32:0] t; 

begin 

OlO t = $time; 

2 $display(t, "L: %h",dL); 

$display(t, "S: %h", dS); 
H if(t > 1024) 

if! $finish; 
'f^ 15 end 
^ endtask 

y endmodule 

20 When the software in the RCC system receives the testbench call signal from the behavior 

processor, it stops the hardware emulation while it processes the software task. After processing 
the task, it sends a signal back to the behavior processor so that hardware emulation can resume. 
The following sequence continues throughout the debug session so long as the behavior 
processor is used: 

25 

1 . satisfying some predefined condition in the hardware emulator, 

2. triggering the delivery of a testbench call signal to a software process in the RCC, 

3. halting the hardware emulation mode during the processing of the task associated with the 
testbench call, and 
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4. sending a signal back to the hardware emulator to resume emulation continues during the 
debug process. 

By defining and processing the conditional settings in the hardware emulator, 
performance is improved. The software running on the RCC system need not expend valuable 
5 time processing conditional statements. Once the condition is satisfied in hardware, the hardware 
emulator sends an interrupt-like signal back to the software processes for performing the tasks 
associated with the testbench call. Note that unlike the standard debugging system, one 
embodiment of the present invention allows the user to define and model in hardware those 
behavioral functionality that was previously modeled in software, and set conditions to be 
yiO processed in the hardware emulator. 

J VII . SIMULATION SERVER 

, A Simulation server in accordance with another embodiment of the present invention is 
tfl provided to allow multiple users to access the same reconfigurable hardware unit to effectively 
p 15 simulate and accelerate the same or different user designs in a time-shared maimer. A high speed 
yj simulation scheduler and state swapping mechanisms are employed to feed the Simulation server 

with active simulation processes which results in a high throughput. The server provides the 
H multiple users or processes to access the reconfigurable hardware unit for acceleration and 

hardware state swapping purposes. Once the acceleration has been accomplished or the hardware 
20 state has been accessed, each user or process can then shnulate in software only, thus releasing 
control of the reconfigurable hardware unit to other users or processes. 

In the Simulation server portion of this specification, terms such as "job" and "process" 
are used. In this specification, the terms "job" and "process" are generally used 
interchangeably. In the past, batch systems executed "jobs" and time-shared systems stored and 
25 executed "processes" or programs. In today's systems, these jobs and processes are similar. 
Thus, in this specification, the term "job" is not limited to batch-type systems and "process" is 
not limited to time-shared systems; rather, at one extreme, a "job" is equivalent to a "process" if 
the "process" can be executed within a time slice or without interruption by any other time- 
shared intervener, and at the other extreme, a "job" is a subset of a "process" if the "job" 
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requires multiple time slices to complete. So, if a "process" requires multiple time slices to 
execute to completion due to the presence of other equal priority users/processes, the "process" 
is divided up into "jobs." Moreover, if the "process" does not require multiple time slices to 
execute to completion because it is the sole high priority user or the process is short enough to 
5 complete within a time slice, the "process" is equivalent to a "job." Thus, a user can interact 
with one or more "processes" or programs that have been loaded and executed in the Sunulation 
system, and each "process" may require one or more "jobs" to complete in a time-shared 
system. 

In one system configuration, multiple users via remote terminals can utilize the same 
510 multiprocessor workstation in a non-network environment to access the same reconfigurable 
S hardware unit to review/debug the same or different user circuit design. In a non-network 
4" environment, remote terminals are connected to a main computing system for access to its 
i2 processing functions. This non-network configuration allows multiple users to share access to the 
^^'^ same user design for parallel debugging purposes. The access is accomplished via a time-shared 
015 process in which a scheduler determines access priorities for the multiple users, swaps jobs, and 
|I selectively locks hardware unit access among the scheduled users. In other instances, multiple 
IM users may access the same reconfigurable hardware unit via tiie server for his/her own separate 
K and different user design for debugging purposes. In this configuration, the multiple users or 

processes are sharing the multiple microprocessors in the workstation with the operating system. 
20 In another configuration, multiple users or processes in separate microprocessor-based 

workstations can access the same reconfigurable hardware unit to review/debug the same or 
different user circuit design across a network. Similarly, the access is accomplished via a time- 
shared process in which a scheduler determines access priorities for the multiple users, swaps 
jobs, and selectively locks hardware unit access among the scheduled users. In a network 
25 environment, the scheduler listens for network requests through UNIX socket system calls. The 
operating system uses sockets to send commands to the scheduler. 

As stated earlier, the Simulation scheduler uses a preemptive multiple priority round robin 
algorithm. In other words, higher priority users or processes are served first until the user or 
process completes the job and ends the session. Among equal priority users or processes, a 
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preemptive round robin algorithm is used in which each user or process is assigned an equal time 
slice to execute its operations until completed. The time slice is short enough such that multiple 
users or process will not have to wait a long time before being served. The time slice is also long 
enough such that sufficient operations are executed before the Simulation server's scheduler 
5 interrupts one user or process to swap in and execute the new user's job. In one embodiment, the 
default time slice is 5 seconds and is user settable. In one embodiment, the scheduler makes 
specific calls to the operating system's built-in scheduler. 

FIG. 45 shows a non-network environment with a multiprocessor workstation in 
accordance with one embodiment of the present invention, FIG. 45 is a variation of FIG. 1, and 

01 0 accordingly, like reference numerals will be used for like components/units. Workstation 1100 
S includes local bus 1105, a host/PCI bridge 1106, memory bus 1107, and main memory 1108. A 
%; cache memory subsystem (not shown) may also be provided. Other user interface units (e.g., 

''J monitor, keyboard) are also provided but not shown in FIG. 45. Workstation 1100 also includes 
iff multiple microprocessors 1101, 1102, 1103, and 1104 coupled to the local bus 1105 via a 
^15 scheduler 1117 and connections/path 1118. As known to those skilled in the art, an operating 

system 1121 provides the user-hardware interface foundation for the entire computing 
fU environment for managing files and allocating resources for the various users, processes, and 

11 devices in the computing environment. For conceptual purposes the operating system 1121 along 
with a bus 1122 are shown. References to operating systems can be made in Abraham 

20 Silberschatz and James L. Peterson, OPERATING SYSTEM CONCEPTS (1988) and William 
Stallings, MODERN OPERATING SYSTEMS (1996), which are incorporated herein by 
reference. 

In one embodiment, the workstation 1 100 is a Sun Microsystems Enterprise 450 system 
which employs UltraSPARC II processors. Instead of the memory access via the local bus, the 
25 Sun 450 system allows the multiprocessors to access the memory via dedicated buses to the 
memory through a crossbar switch. Thus, multiple processes can be running with multiple 
microprocessors executing their respective instructions and accessing the memory without going 
through the local bus. The Sun 450 system along with the Sun UltraSPARC multiprocessor 
specifications are incorporated herein by reference. The Sun Ultra 60 system is another example 
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of a microprocessor system although it allows only two processors. 

The scheduler 1117 provides the time-shared access to the reconfigurable hardware unit 

20 via the device driver 1119 and connections/path 1120. Scheduler 1117 is implemented mostly 

in software to interact with the operating system of the host computing system and partially in 
5 hardware to interact with the Sunulation server by supporting the simulation job interruption and 

swapping in/out the simulation sessions. The scheduler 1117 and device driver 1119 will be 

discussed in more detail below. 

Each microprocessor 1101-1104 is capable of processing independently of the other 

microprocessors in the workstation llOL In one embodiment of the present invention, the 
yiO workstation 1100 is operating under a UNIX-based operating system, although in other 
hQ embodiments, the workstation 1100 can operate under a Windows-based or Macuitosh-based 
2 operating system. For UNIX-based systems, the user is equipped with X-Wmdows for the user 
,7 interface to manage programs, tasks, and files as necessary. For details on the UNIX operating 
in system, reference is made to Maurice J. Bach, THE DESIGN OF THE UNIX OPERATING 
□ 15 SYSTEM (1986). 

2 In FIG. 45, multiple users can access workstation 1 100 via remote terminals. At times, 

y each user may be using a particular CPU to run its processes. At other times, each user uses 
different CPUs depending on the resource limitations. Usually, the operating system 1121 
determines such accesses and indeed, the operating system itself may jump from one CPU to 
20 another to accomplish its tasks. To handle the time-sharing process, the scheduler listens for 
network requests through socket system calls makes system calls to the operating system 1121, 
which in turn handles preemption by initiating the generation of interrupt signals by the device 
driver 1 1 19 to the reconfigurable hardware unit 20. Such interrupt signal generation is one of 
many steps in the scheduling algorithm which includes stopping the current job, saving state 
25 information for the currently interrupted job, swapping jobs, and executing the new job. The 
server scheduling algorithm will be discussed below. 

Sockets and socket system calls will now be discussed briefly. The UNIX operating 
system, m one embodiment, can operate on a time-sharing mode. The UNIX kernel allocates the 
CPU to a process for a period of time (e.g., time slice) and at the end of the time slice, preempts 
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the process and schedules another one for the next time slice. The preempted process from the 
previous time slice is rescheduled for execution at a later time slice. 

One scheme for enabling and facilitating interprocess commimication and allowing use of 
sophisticated network protocols is sockets. The kernel has three layers that function in the 
5 context of a client-server model. These three layers include the socket layer, the protocol layer, 
and the device layer. The top layer, the socket layer, provides the interface between the system 
calls and the lower layers (protocol layer and device layer). Typically, the socket has end points 
that couple client processes with server processes. The socket end points can be on different 
machines. The middle layer, the protocol layer, provides the protocol modules for 
□10 communication, such as TCP and IP. The bottom layer, the device layer, contains the device 
%0 drivers that control the network devices. One example of a device driver is an Ethernet driver 
over an Ethernet-based network. 

Processes communicate using the client-server model where the server process listens to a 
in socket at one end point and a client process to the server process over another socket at the other 
glS end point of the two-way communication path. The kernel maintains internal connections among 
?7 the three layers of each client and server and routes data from client to the server as needed, 
y The socket contains several system calls including a socket system call which establishes 

14 the end points of a communication path. Many processes use the socket descriptor sd in many 
system calls. The bind system call associates a name with a socket descriptor. Some other 
20 exemplary system calls include the connect system call requests that the kernel make a connection 
to a socket, the close system call closes sockets, the shutdown system call closes a socket 
connection, and the send and recv system calls transmit data over a connected socket. 

FIG. 46 shows another embodiment in accordance with the present invention m which 
multiple workstations share a single Simulation system on a time-shared basis across a network. 
25 The multiple workstations are coupled to the Simulation system via a scheduler 1117. Within 
the computing environment of the Simulation system, a single CPU 11 is coupled to the local bus 
12 in station 1110. Multiple CPUs may also be provided in this system. As known to those 
skilled in the art, an operating system 1118 is also provided and nearly all processes and 
applications reside on top of the operating system. For conceptual purposes the operating system 
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1121 along with a bus 1122 are shown. 

In FIG. 46, workstation 1110 includes those components/units found in FIG. 1 along with 
scheduler 1117 and scheduler bus 1118 coupled to the local bus 12 via the operating system 1121. 
Scheduler 1117 controls the time-shared access for the user stations 1111, 1112, and 1113 by 
5 making socket calls to the operating system 1121. Scheduler 1117 is implemented mostly in 
software and partially in hardware. 

In this figure, only three users are shown and capable of accessing the Simulation system 
across the network. Of course, other system configurations provide for more than three users or 
less than three users. Each user accesses the system via remote stations 1111, 1112, or 1113. 
OlO Remote user stations 1111, 1112, and 1113 are coupled to the scheduler 1117 via network 
a connections 1114, 1115, and 1116, respectively. 

^ As known to those skilled in the art, device driver 1 1 19 is coupled between the PCI bus 

50 and the reconfigurable hardware unit 20. Connection or electrically conductive path 1 120 are 
iff provided between the device driver 1119 and the reconfigurable hardware unit 20. In this 
□ 15 network multi-user embodiment of the present invention, the scheduler 1117 interfaces with the 
J device driver 1119 via the operating system 1121 to communicate and control the reconfigurable 
jy hardware unit 20 for hardware acceleration and simulation after hardware state restoration 
purposes. 

Again, in one embodiment, the Simulation workstation 1100 is a Sun Microsystems 
20 Enterprise 450 system which employs UltraSPARC II multiprocessors. Instead of the memory 
access via the local bus, the Sun 450 system allows the multiprocessors to access the memory via 
dedicated buses to the memory through a crossbar switch instead of tying up the local bus. 

FIG. 47 shows a high level structure of the Simulation server in accordance with the 
network embodiment of the present invention. Here, the operating system is not explicitly shown 
25 but, as known to those skilled in the art, it is always present for file management and resource 
allocation purposes to serve the various users, processes, and devices in the Simulation 
computing environment. Simulation server 1130 includes the scheduler 1137, one or more device 
drivers 1 138, and the reconfigurable hardware unit 1139. Although not expressly shown as a 
single mtegral unit in FIGS. 45 and 46, the Simulation server comprises the scheduler 1117, 
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device driver 1119, and the reconfigurable hardware unit 20, Returning to FIG. 47, the 
Simulation server 1130 is coupled to three workstations (or users) 1131, 1132, and 1133 via 
network connections/paths 1134, 1135, and 1136, respectively. As stated above, more than three 
or less than three workstations may be coupled to the Simulation server 1130. 
5 The scheduler in the Simulation server is based on a preemptive round robin algorithm. 

In essence, the round robin scheme allows several users or processes to execute sequentially to 
completion with a cyclic executive. Thus, each simulation job (which is associated with a 
workstation in a network environment or a user/process in a multiprocessing non-network 
environment) is assigned a priority level and a fixed tune slice in which to execute. 
OlO Generally, the higher priority jobs execute first to completion. At one extreme, if 

J3 different users each have different priorities, the user with the highest priority is served first until 

this user's job(s) is/are completed and the user with the lowest priority is served last. Here, no 
""J time slice is used because each user has a different priority and the scheduler merely serves users 
in according to priority. This scenario is analogous to having only one user accessing the 
hj 15 Simulation system until completion. 

fj At the other extreme, the different users have equal priority. Thus, the tune slice concept 

ni with a first-in first-out (FIFO) queue are employed. Among equal priority jobs, each job 
[2: executes until it completes or the fixed time slice expires, whichever comes first. If the job does 
not execute to completion during its time slice, the simulation unage associated with whatever 
20 tasks it has completed must be saved for later restoration and execution. This job is then placed 
at the end of the queue. The saved shnulation image, if any, for the next job is then restored and 
executed in the next time slice. 

A higher priority job can preempt a lower priority job. In other words, jobs of equal 
priority run in round robin fashion until they execute through the time slices to completion. 
25 Thereafter, jobs of lower priority run in round robin fashion. If a job of higher priority is 

inserted in the queue while a lower priority job is running, the higher priority job will preempt 
the lower priority job until the higher priority job executes to completion. Thus, jobs of higher 
priority run to completion before jobs of lower priority begin execution. If the lower priority job 
has already begun execution, the lower priority job will not be further executed to completion 
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until the higher priority job executes to completion. 

In one embodiment, the UNIX operating system provides the basic and foundational 
preemptive round robin scheduling algorithm. The Simulation server's scheduling algorithm in 
accordance with one embodiment of the present invention works in conjunction with the operating 
5 system's scheduling algorithm. In UNIX-based systems, the preemptive nature of the scheduling 
algorithm provides the operating system to preempt user-defined schedules. To enable the time- 
sharing scheme, the Simulation scheduler uses a preemptive multiple priority round robin 
algorithm on top of the operating system's own scheduling algorithm. 

The relationship between the multiple users and the Simulation server in accordance with 
C3lO one embodiment of the present invention follows a client-server model, where the multiple users 
^ are clients and the Simulation server is the server. Conmiunication between the user clients and 
%l the server occurs via socket calls. Referring briefly to FIG. 55, the client inchides client 

program 1109, a socket system call component 1123, UNIX kernel 1124, and a TCP/IP protocol 
m component 1125. The server includes a TCP/IP protocol component 1126, a UNIX kernel 1127, 
n 15 socket system call component 1128, and the Simulation server 1129. Multiple clients may 

request simulation jobs to be simulated in the server through UNIX socket calls from the client 
rll application program. 

y In one embodiment, a typical sequence of events includes multiple clients sendmg requests 

to the server via the UNIX socket protocol. For each request, the server acknowledges the 
20 requests as to whether the conmiand was successfiiUy executed. For the request of server queue 
status, however, the server replies with the current queue state so that it can be properly 
displayed to the user. Table F below lists the relevant socket commands from the client: 



Table F: Client Socket Conamands 



Commands 


Description 


0 


Start simulation < design > 


1 


Pause simulation < design > 


2 


Exit simulation < design > 
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Commands 


Description 


3 


Re-assign priority to simulation session 


4 


Save design simulation state 


5 


Queue status 



For each socket call, each command encoded in integers may be followed with additional 
parameters such as < design > which represents the design name. Response from the Simulation 
server will be "0" if the command is executed successfully or a " 1" if the command failed. For 
5 command "5" which requests queue status, one embodiment of the command's return response is 
Q ASCII text terminated by a "\0" character for display onto the user's screen. With these system 
kQ socket calls, the appropriate communication protocol signals are transmitted to and received from 
J the reconfigurable hardware unit via device drivers. 

FIG. 48 shows one embodiment of the architecture of the Simulation server in accordance 
mlO with the present invention. As explained above, multiple users or multiple processes may be 
n served by the single Simulation server for simulation and hardware acceleration of the users' 
[2 designs in a time-shared manner. Thus, user/process 1147, 1148, and 1149 are coupled to the 
W Simulation server 1140 via inter-process communication paths 1150, 1151, and 1152, 
M respectively. The inter-process conraiunication paths 1150, 1151, and 1152 may reside in the 
15 same workstation for multiprocessor configuration and operation, or in the network for multiple 
workstations. Each shnulation session contains software simulation states along with hardware 
states for communication with the reconfigurable hardware unit. Inter-process communication 
among the software sessions is performed using UNIX socket or system calls which provide the 
capability to have the shnulation session reside on the same workstation where the Simulator 
20 plug-in card is installed or on a separate workstation connected via a TCP/IP network. 
Communication with the Simulation server will be initiated automatically. 

In FIG. 48, Simulation server 1140 includes the server monitor 1141, a simulation job 
queue table 1142, apriority sorter 1143, a job swapper 1144, device driver(s) 1145, and the 
reconfigurable hardware unit 1146. The simulation job queue table 1142, priority sorter 1143, 
25 and job swapper 1144 make up the scheduler 1137 shown in FIG. 47. 
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The server monitor 1141 provides user interface functions for the administrator of the 
system. The user can monitor the status of the Simulation server state by commanding the system 
to display simulation jobs in the queue, scheduling priority, usage history, and simulation job 
swapping efficiency. Other utility functions include editing job priority, deleting simulation jobs, 
5 and resetting the simulation server state. 

The simulation job queue table 1142 keeps a list of all outstanding simulation requests in 
the queue which was inserted by the scheduler. The table entries include job number, software 
simulation process number, software simulation image, hardware simulation image file, design 
configuration file, priority number, hardware size, software size, cumulative time of the 
plO simulation run, and owner identification. The job queue is implemented using a first-in first-out 

(FIFO) queue. Thus, when a new job is requested, it is placed at the end of the queue. 
^2 The priority sorter 1143 decides which simulation job in the queue to execute. In one 

embodiment, the simulation job priority scheme is user definable (i.e., controllable and definable 
m by the system administrator) to control which simulation process has priority for current 
^L^^ 15 execution. In one embodiment, the priority levels are fixed based on the urgency of specific 
i3 processes or importance of specific users. In another embodiment, the priority levels are 
U dynamic and can change during the course of the simulation. In the preferred embodiment, 
T priority is based on the user ID. Typically, one user will have a high priority and all other users 
will have lower but equal priority, 
20 Priority levels are settable by the system administrator. Simulator server obtains all user 

information from the UNIX facility, typically found in the UNIX user file called "/etc/passwd" . 
Adding new users is consistent with the process of adding new users within the UNIX system. 
After all users are defined, the Simulator server monitor can be used to adjust priority levels for 
the users. 

25 The job swapper 1 144 temporarily replaces one simulation job associated with one process 

or one workstation for another simulation job associated with another process or workstation 
based on the priority determination programmed for the scheduler. If multiple users are 
simulating the same design, the job swapper swaps in only the stored simulation state for the 
simulation session. However, if multiple users are simulating multiple designs, the job swapper 
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loads in the design for hardware configuration before swapping in the simulation state. In one 
embodiment, the job swapping mechanism enhances the performance of the time-sharing 
embodiment of the present invention because the job swapping need only be done for 
reconfigurable hardware unit access. So, if one user needs software simulation for some time 
5 period, the server swaps in another job for another user so that this other user can access the 
reconfigurable hardware unit for hardware acceleration. The frequency of the job swapping can 
be user adjustable and programmable. The device driver also communicates with the 
reconfigurable hardware unit to swap jobs. 

The operation of the Simulation server will now be discussed. FIG. 49 shows a flow 
QlO diagram of the Simulation server during its operation. Initially, at step 1160, the system is idle. 

i- ■: t 

When the system is idle in step 11 60, the Simulation server is not necessarily inactive or that a 
5 simulation task is not running. Indeed, idleness may mean one of several things: (1) no 

simulation is running; (2) only one user/workstation is active in a single processor environment 
If! so that time-sharing is not required; or (3) only one user/workstation in a multiprocessing 
p 15 environment is active but only one process is running. Thus, conditions 2 and 3 above indicate 

that the Simulation server has only one job to process so that queuing jobs, determining 
[U priorities, and swappmg jobs are not necessary and essentially, the Simulation server is idle 
because it receives no requests (event 1161) from other workstations or processes. 

When a simulation request occurs due to one or more request signals from a workstation 
20 in a multi-user environment or fi^om a microprocessor in a multiprocessor environment, the 

Simulation server queues the incoming simulation job or jobs at step 1162. The scheduler keeps 
a simulation job queue table to insert all outstanding simulation requests onto its queue and list all 
outstanding simulation requests. For batch simulation jobs, the scheduler in the server queues all 
the incoming simulation requests and automatically processes the tasks without human 
25 intervention. 

The Simulation server then sorts the queued jobs to determine priority at step 1 163. This 
step is particularly important for multiple jobs where the server has to prioritize among them to 
provide access to the reconfigurable hardware unit. The priority sorter decides which simulation 
job in the queue to execute. In one embodiment, the simulation job priority scheme is user 
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definable (i.e., controllable and definable by the system administrator) to control which 
simulation process has priority for current execution if a resource contention exists. 

After priority sorting at step 1163, the server then swaps sunulation jobs, if necessary, at 
step 1164. This step temporarily replaces one simulation job associated with one process or one 
5 workstation for another simulation job associated with another process or workstation based on 
the priority determination programmed for the scheduler in the server. If multiple users are 
simulating the same design, the job swapper swaps in only the stored simulation state for the 
simulation session. However, if multiple users are simulating multiple designs, the job swapper 
loads in the design first before swapping in the sknulation state. Here, the device driver also 
fJO comnnmicates with the reconfigurable hardware unit to swap jobs. 

^5 In one embodiment, the job swapping mechanism enhances the performance of the time- 

^ sharing embodiment of the present invention because the job swapping need only be done for 
\t reconfigurable hardware unit access. So, if one user needs software simulation for some time 
jri period, the server swaps in another job for another user so that this other user can access the 
1^15 reconfigurable hardware unit for hardware acceleration. For example, assume that two users, 
tfi user 1 and user 2, are coupled to the Simulation server for access to the reconfigurable hardware 
pj unit. At one time, user 1 has access to the system so that debugging can be performed for his/her 
user design. If user 1 is debugging in software mode only, the server can release the 
reconfigurable hardware unit so that user 2 can access it. The server swaps in the job for user 2 
20 and user 2 can then either software simulate or hardware accelerate the model. Depending on the 
priorities between user 1 and user 2, user 2 can continue accessing the reconfigurable hardware 
unit for some predetermined time or, if user 1 needs the reconfigurable hardware unit for 
acceleration, the server can preempt the job for user 2 so that the job for user 1 can be swapped 
in for hardware acceleration using the reconfigurable hardware unit. The predetermined time 
25 refers to the pre-emption of simulator jobs based on multiple requests of the same priority. In 
one embodiment, the default time is 5 minutes although this time is user settable. This 5 minute 
setting represents one form of a time-out timer. The Simulation system of the present invention 
uses the time-out timer to stop the execution of the current simulation job because it is 
excessively time consuming and the system decides that other pending jobs of equal priority 
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should gain access to the reconfigurable hardware model. 

Upon the completion of the job swapping step in step 1164, the device driver in the server 
locks the reconfigurable hardware unit so that only the currently scheduled user or process can 
simulate and use the hardware model. The locking and simulation step occurs at step 1165. 
5 At the occurrence of either the completion of simulation or a pause in the currently 

sunulating session at event 1166, the server returns to the priority sorter step 1163 to determine 
priority of pending simulation jobs and later swap simulation jobs if necessary. Similarly, the 
server may preempt the running of the currently active simulation job at event 1167 to return the 
server to the priority sorter state 1163. The preemption occurs only under certain conditions. 

10 One such condition is when a higher priority task or job is pending. Another such condition is 
when the system is currently running a computationally intensive simulation task, in which case 
the scheduler can be programmed to preempt the currently running job to schedule a task or job 
with equal priority by utilizing a time-out timer. In one embodiment, the time-out timer is set at 
5 minutes and if the current job executes for 5 minutes, the system preempts the current job and 

15 swaps m the pending job even though it is at the same priority level. 

FIG. 50 shows a flow diagram of the job swapping process. The job swapping function is 
performed in step 1164 of FIG. 49 and is shown in the Simulation server hardware as job 
swapper 1144 in FIG. 48. In FIG. 50, when a simulation job needs to be swapped with another 
simulation job, the job swapper sends an interrupt to the reconfigurable hardware unit at step 

20 1 180. If the reconfigurable hardware unit is not currently running any jobs (i.e. , the system is 
idle or the user is operating in software simulation mode only without any hardware acceleration 
intervention), the interrupt immediately prepares the reconfigurable hardware unit for job 
swapping. However, if the reconfigurable hardware unit is currently running a job and in the 
midst of executing an instruction or processing data, the interrupt signal is recognized but the 

25 reconfigurable unit continues to execute the currently pending instruction and process the data for 
the current job. If the reconfigurable hardware unit receives the interrupt signal while the current 
simulation job is not in the middle of executing an instruction or processing data, then the 
interrupt signal essentially terminates the operation of the reconfigurable hardware unit 
immediately. 

219 

SV/225583.01 
16503302504 



At step 1181, the Simulation system saves the current simulation image (i.e., hardware 
and software states). By saving this image, users can later restore the simulation run without re- 
running the whole simulation up to that saved point. 

At step 1182,the Simulation system configures the reconfigurable hardware unit with the 
5 new user design. This configuration step is only necessary if the new job is associated with a 
different user design than the one already configured and loaded in the reconfigurable hardware 
unit and whose execution has just been interrupted. After configuration, the saved hardware 
simulation image is reloaded at step 1183 and the saved software simulation image is reloaded at 
step 1 184. If the new simulation job is associated with the same design, then no additional 
CglO configuration is needed. For the same design, the Simulation system loads the desired hardware 
,7^ simulation image associated with the new simulation job for that same design at step 1183 

because the simulation image for the new job is probably different from the simulation image for 
"J the just interrupted job. The details of the configuration step are provided herein in this patent 
Iff specification. Thereafter, the associated software simulation image is reloaded at step 1184. 
^;:^15 After reloading of the hardware and software simulation images, the simulation can begin at step 
p 1 185 ft)r this new job, while the previous interrupted job can only proceed in software simulation 
fy mode only because it has no access to the reconfigurable hardware unit for the moment. 
U FIG. 51 shows the signals between the device driver and the reconfigurable hardware unit. 

The device driver 1171 provides the interface between the scheduler 1170 and the reconfigurable 
20 hardware unit 1172. The device driver 1171 also provides the interface between the entire 
computing environment (i.e., workstation(s), PCI bus, PCI devices) and the reconfigurable 
hardware unit 1172 as shown in FIGS. 45 and 46, but FIG. 51 shows the Simulation server 
portion only. The signals between the device driver and the reconfigurable hardware unit 
includes the bi-directional communication handshake signals, the unidirectional design 
25 configuration information from the computing environment via the scheduler to the reconfigurable 
hardware unit, the swapped in simulation state information, the swapped out simulation state 
information, and the interrupt signal from the device driver to the reconfigurable hardware unit 
so that the simulation jobs can be swapped. 

Line 1173 carries the bi-directional communication handshake signals. These signals and 
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the handshake protocol will be discussed further with respect to FIGS. 53 and 54. 

Line 1174 carries the unidirectional design configuration information from the computing 
environment via the scheduler 1170 to the reconfigurable hardware unit 1172. Initial 
configuration information can be transmitted to the reconfigurable hardware unit 1172 for 
5 modeling purposes on this line 1170. Additionally, when users are modeling and simulating 
different user designs, the configuration information must be sent to the reconfigurable hardware 
unit 1172 during a time slice. When different users are modeling the same user design, no new 
design configuration is necessary; rather, different simulation hardware states associated with the 
same design may need to be transmitted to the reconfigurable hardware unit 1 172 for different 
':^0 simulation runs . 

Line 1 175 carries the swapped in simulation state information to the reconfigurable 
hardware unit 1172. Line 1176 carries the swapped out simulation state information from the 
1 7 reconfigurable hardware unit to the computing environment (i.e., usually memory). The swapped 

in simulation state information includes previously saved hardware model state information and 
C315 the hardware memory state that the reconfigurable hardware unit 1 172 needs to accelerate. The 
swapped in state information is sent at the beginning of a time slice so that the scheduled current 
is user can access the reconfigurable hardware unit 1172 for acceleration. The swapped out state 
H information includes hardware model and memory state information that must be saved m 
memory at the end of a time slice upon the reconfigurable hardware unit 1172 receiving an 
20 interrupt signal to move on to the next time slice associated with a different user/process. The 
saving of the state information allows the current user/process to restore this state at a later time, 
such as at the next time slice that is assigned to this current user /process. 

Line 1177 sends the interrupt signal from the device driver 1171 to the reconfigurable 
hardware unit so that the simulation jobs can be swapped. This interrupt signal is sent between 
25 time slices to swap out the current simulation job in the current time slice and swap in the new 
simulation job for the next time slice. 

The communication handshake protocol m accordance with one embodiment of the present 
invention will now be discussed with reference to FIGS. 53 and 54. FIG. 53 shows the 
communication handshake signals between the device driver and the reconfigurable hardware unit 
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via a handshake logic interface. FIG. 54 shows a state diagram of the communication protocol. 
FIG. 51 shows the communication handshake signals on line 1173. FIG. 53 shows a detailed 
view of the communication handshake signals between the device driver 1171 and the 
reconfigurable hardware unit 1172. 
5 In FIG. 53, a handshake logic interface 1234 is provided in the reconfigurable hardware 

unit 1172. Alternatively, the handshake logic interface 1234 can be installed external to the 
reconfigurable hardware unit 1172. Four sets of signals are provided between the device driver 
1171 and the handshake logic interface 1234. These signals are the 3-bit SPACE signal on line 
1230, a single-bit read/write signal on line 1231, a 4-bit COMMAND signal on line 1232, and a 
QlO single bit DONE signal on line 1233. The handshake logic interface includes logic circuitry that 
iQ processes these signals to place the reconfigurable hardware unit in the proper mode for the 

various operations that need to be performed. The uiterface is coupled to the CTRL_FPGA unit 
(or FPGA I/O controller) . 

iff For the 3-bit SPACE signal, the data transfers between the Simulation system's computing 

fj 15 environment over the PCI bus and the reconfigurable hardware unit are designated for certain I/O 
address spaces in the software/hardware boundary - REG (register), CLK (software clock), S2H 
^2 (software to hardware), and H2S (hardware to software). As explained above, the Simulation 
H system maps the hardware model into four address spaces in main memory according to different 
component types and control functions: REG space is designated for the register components; 
20 CLK space is designated for the software clocks; S2H space is designated for the output of the 
software test-bench components to the hardware model; and H2S space is designated for the 
output of the hardware model to the software test-bench components. These dedicated I/O buffer 
spaces are mapped to the kernel's main memory space during system initialization time. 
The following Table G provides a description of each of the SPACE signals 



TABLE G: SPACE Signal 



J SPACE 




000 


Global (or CLK) space and software to hardware (DMA wr) 


001 


Register write (DMA wr) 
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• # 







010 


Hardware to software (DMA rd) 


oil 


Register Read (DMA rd) 


100 


SRAM Write (DMAwr) 


101 


SRAM Read (DMA rd) 


110 


Unused 


111 


Unused 



The read/write signal on line 1231 indicates whether the data transfer is a read or a write, 

y The DONE signal on line 1233 indicates the completion of a DMA data transfer period, 

a The 4-bit COMMAND indicates whether the data transfer operation should be a write, 

M 5 read, configure new user design mto the reconfigurable hardware unit, or interrupt the 

il simulation. As shown in Table H below, the COMMAND protocol is as follows: 

r. H 

a TABLE H: COMMAND Signal 



; COMMAND 




0000 


Write into designated space 


0001 


Read from designated space 


0010 


Configure FPGA design 


0011 


Interrupt simulation 


0100 


Unused 



10 The communication handshake protocol will now be discussed with reference to the state 

diagram on FIG. 54. At state 1400, the Simulation system at the device driver is idle. As long 
as no new command is presented, the system remains idle as indicated by path 1401. When a 
new command is presented, the command processor processes the new command at state 1402. 
In one embodiment, the command processor is the FPGA I/O controller. 

15 If COMMAND ^0000 OR COMMAND =0001, the system reads from or writes to the 
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designated space as indicated by the SPACE index at state 1403. If COMMAND=0010, the 
system to initially configures the FPGAs in the reconfigurable hardware unit with a user design 
or configures the FPGAs with a new user design at state 1404. The system sequences 
configuration information for all the FPGAs to model the portion of the user design that can be 
5 modeled into hardware. If, however, COMMAND =0011, the system interrupts the 

reconfigurable hardware unit at state 1405 to interrupt the Simulation system because the time 
slice has timed out for a new user/process to swap in a new simulation state. At the completion 
of these states 1403, 1404, or 1405, the Simulation system proceeds to the DONE state 1406 to 
generate the DONE signal, and then returns to state 1400 where it is idle until a new command is 
^;Jl0 presented. 

m The time-sharing feature of the Simulation server for handling multiple jobs with different 

,g levels of priorities will now be discussed. FIG. 52 illustrates one example. Four jobs (job A, 
; T job B, job C, job D) are the incoming jobs in the simulation job queue. However, the priority 
'^n levels for these four jobs are different; that is, jobs A and B are assigned high priority I, whereas 
C315 jobs C and D are assigned lower priority II. As shown in the time line chart of FIG. 52, the 
|t time-shared reconfigurable hardware unit usage depends on the priority levels of the queued 
^ incoming jobs. At time 1190, the simulation starts with job A given access to the reconfigurable 
hardware unit. At time 1 191, job A is preempted by job B because job B has the same priority as 
job A and the scheduler provides equal time-shared access to the two jobs. Job B now has access 
20 to the reconfigurable hardware unit. At time 1 192, job A preempts job B and job A executes to 
completion at time 1193. At time 1193, job B takes over and it executes to completion to time 
1 194. At time 1 194, job C, which is next in the queue but with a lower priority level than jobs A 
and B, now has access to the reconfigurable hardware unit for execution. At time 1 195, job D 
preempts job C for time-shared access because it has the same priority level as job C. Job D now 
25 has access until time 1 196 where it is preempted by job C. Job C executes to completion at time 
1 197. Job D then takes over at time 1 197 and executes to completion until time 1 198. 



VIII. MEMORY SIMULATION 

The Memory Simulation or memory mapping aspect of the present mvention provides an 
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effective way for the Simulation system to manage the various memory blocks associated with the 
configured hardware model of the user's design, which was programmed into the array of FPGA 
chips in the reconfigurable hardware unit. By implementing the embodiments of the present 
invention, the memory Simulation scheme does not require any dedicated pins in the FPGA chips 
5 to handle the memory access. 

As used herein, the phrase "memory access" refers to either a write access or a read 
access between the FPGA logic devices where the user's design is configured and the SRAM 
memory devices which stores all the memory blocks associated with the user's design. Thus, a 
write operation involves data transfer from the FPGA logic devices to the SRAM memory 
';iO devices, while a read operation involves data transfer from the SRAM memory devices to the 
^3 FPGA logic devices. Referring to FIG. 56, the FPGA logic devices inchide 1201 (FPGAl), 
IH 1202 (FPGA3), 1203 (FPGAO), and 1204 (FPGA2). The SRAM memory devices include 
• J memory devices 1205 and 1206. 

Also, the phrase "DMA data transfer" refers to data transfer between the computing 
Q 15 system and the Simulation system, in addition to its common usage among those ordinarily skilled 
in the art. The computing system is shown in FIGS. 1, 45, and 46 as the entire PCI-based system 
}i; with memory that supports the Simulation system, which resides in software as well as the 

reconfigurable hardware unit. Selected device drivers, socket/system calls to/from the operating 
system are also part of the Simulation system that allow the proper interface with the operating 
20 system and the reconfigurable hardware unit. In one embodiment of the present invention, a 

DMA read transfer involves the transfer of data from the FPGA logic devices (and FPGA SRAM 
memory devices for initialization and memory content dump) to the host computing system. A 
DMA write transfer involves the transfer of data from the host computing system to the FPGA 
logic devices (and FPGA SRAM memory devices for initialization and memory content dump). 
25 The terms "FPGA data bus," "FPGA bus," "FD bus," and variations thereof refer to the 

high bank bus FD[63:32] and low bank bus FD[31:0] coupling the FPGA logic devices which 
contain the configured and programmed user design to be debugged and the SRAM memory 
devices. 

The memory Simulation system includes a memory state machine, an evaluation state 
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machine, and their associated logic to control and interface with: (1) the main computing system 
and its associated memory system, (2) the SRAM memory devices coupled to the FPGA buses in 
the Simulation system, and (3) the FPGA logic devices which contain the configured and 
programmed user design that is being debugged. 
5 The FPGA logic device side of the memory Simulation system includes an evaluation state 

machine, an FPGA bus driver, and a logic interface for each memory block N to interface with 
the user's own memory interface in the user design to handle: (1) data evaluations among the 
FPGA logic devices, and (2) write/read memory access between the FPGA logic devices and the 
SRAM memory devices. In conjunction with the FPGA logic device side, the FPGA I/O 
CiiO controller side includes a memory state machine and interface logic to handle DMA, write, and 
JJ read operations between: (1) main computing system and SRAM memory devices, and (2) FPGA 

logic devices and the SRAM memory devices. 
' %| The operation of the memory Simulation system in accordance with one embodiment of 

In the present invention is generally as follows. The Simulation write/read cycle is divided into 
% 15 three periods - DMA data transfer, evaluation, and memory access. The DATAXSFR signal 
p indicates the occurrence of the DMA data transfer period where the computing system and the 

1 y SRAM memory units are transferring data to each other via the FPGA data bus - high bank bus 

2 (FD[63:32]) 1212 and low bank bus (FD[31:0]) 1213. 

During the evaluation period, logic circuitry in each FPGA logic device generates the 
20 proper software clock, input enable, and mux enable signals to the user's design logic for data 
evaluation. Inter-FPGA logic device communication occurs in this period. 

During the memory access period, the memory Sunulation system waits for the high and 
low bank FPGA logic devices to put their respective address and control signals onto their 
respective FPGA data buses. These address and control signals are latched in by the 
25 CTRL_FPGA unit. If the operation is a write, then address, control, and data signals are 
transported from the FPGA logic devices to their respective SRAM memory devices. If the 
operation is a read, then address and control signals are provided to the designated SRAM 
memory devices, and data signals are transported from the SRAM memory devices to theb 
respective FPGA logic devices. After all desired memory blocks in all FPGA logic devices have 
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been accessed, the memory Simulation write/read cycle is complete and the memory Simulation 
system is idle until the onset of the next memory Simulation write/read cycle. 

FIG. 56 shows a high level block diagram of the memory Simulation configuration in 
accordance with one embodiment of the present invention. Signals, connections, and buses that 
5 are not relevant to the memory Simulation aspect of the present invention are not shown. A 
CTRL FPGA unit 1200, described above, is coupled to bus 1210 via line 1209. In one 
embodiment, the CTRL_FPGA unit 1200 is a programmable logic device (PLD) in the form of 
an FPGA chip, such as an Altera 10K50 chip. Local bus 1210 allows the CTRL FPGA unit 
1200 to be coupled to other Shnulation array boards (if any) and other chips (e.g., PCI 

^'io controller, EEPROM, clock buffer). Line 1209 carries the DONE signal which indicates the 

jO completion of a Simulation DMA data transfer period. 

£ FIG. 56 shows other major fimctional blocks in the form of logic devices and memory 

l2 devices. In one embodiment, the logic device is a programmable logic device (PLD) in the form 
of an FPGA chip, such as an Altera 10K130 or 10K250 chip. Thus, instead of the embodiment 
015 shown above with the eight Altera FLEX lOKlOO chips in the array, this embodiment uses only 
u four chips of Altera's FLEX 10K130. The memory device is a synchronous-pipelined cache 



SRAM, such as a Cypress 128Kx32 CY7C1335 or CY7C1336 chip. The logic devices include 
1201 (FPGAl), 1202 (FPGA3), 1203 (FPGAO), and 1204 (FPGA2), The SRAM chips include 
low bank memory device 1205 (L_SRAM) and high bank memory device 1206 (H SRAM). 



high bank bus 1212 (FD[63:32]) and a low bank bus 1213 (FD[31:0]). Logic devices 1201 
(FPGAl) and 1202 (FPGA3) are coupled to the high bank bus 1212 via bus 1223 and bus 1225, 
respectively, while logic devices 1203 (FPGAO) and 1204 (FPGA2) are coupled to the low bank 
data bus 1213 via bus 1224 and bus 1226, respectively. High bank memory device 1206 is 
25 coupled to the high bank bus 1212 via bus 1220, while low bank memory device 1205 is coupled 
to the low bank bus 1213 via bus 1219. The dual bank bus structure allows the Simulation system 
to access the devices on the high bank and the devices on the low bank in parallel at improved 
throughput rates. The dual bank data bus structure supports other signals, such as control and 
address signals, so that the Simulation write/read cycles can be controlled. 



20 



These logic devices and memory devices are coupled to the CTRL_FPGA unit 1200 via a 
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Turning briefly to FIG, 61, each Simulation write/read cycle includes a DMA data 
transfer period, an evaluation period, and a memory access period. The combination of the 
various control signals control and indicate whether the Simulation system is in one period as 
opposed to another. DMA data transfer between the host computer system and the logic devices 
5 1201 to 1204 in the reconfigurable hardware unit occurs across the PCI bus (e.g., bus 50 in FIG. 
46), the local bus 1210 and 1236, and the FPGA bus 1212 (FD[63:32]) and 1213 (FD[31:0]). 
The memory devices 1205 and 1206 are involved in DMA data transfer for initialization and 
memory content dumps. Evaluation data transfer among the logic devices 1201-1204 in the 
reconfigurable hardware unit occurs across the interconnects (as described above) and the FPGA 
ClO bus 1212 (FD[63:32]) and 1213 (FD[31:0]). Memory access between the logic devices 1201 to 
2 1204 and the memory devices 1205 and 1206 occurs across the FPGA bus 1212 (FD[63:32]) and 
% 1213 (FD[31:0]). 

Returning to FIG. 56, the CTRL_FPGA unit 1200 provides and receives many control 
iff and address signals to control the Sinnilation write/read cycles. The CTRL FPGA unit 1200 
ql5 provides DATAXSFR and '-EVAL signals on line 1211 to logic devices 1201 and 1203 via line 

nil, respectively, and logic devices 1202 and 1204 via line 1222, respectively. The 
fU CTRL_FPGA unit 1200 also provides memory address signals MA[18:2] to the low bank 
li memory device 1205 and the high bank memory device 1206 via buses 1229 and 1214, 
respectively. To control the mode of these memory devices, the CTRL FPGA unit 1200 
20 provides chip select write (and read) signals to the low bank memory device 1205 and the high 
bank memory device 1206 via lines 1216 and 1215, respectively. To indicate the completion of a 
DMA data transfer, the memory Simulation system can send and receive the DONE signal on line 
1209 to the CTRL FPGA unit 1200 and the computing system. 

As discussed previously with respect to FIGS. 9, 11, 12, 14, and 15, the logic devices 
25 1201-1204 are connected together by, among other things, the multiplexed cross chip address 
pointer cham represented here in FIG. 56 by the two sets of SHIFTIN/SHIFTOUT Imes - lines 
1207, 1227, and 1218, and lines 1208, 1228, and 1217, These sets are initialized at the 
beginning of the chain by Vcc at lines 1207 and 1208. The SHIFTIN signal is sent from the 
preceding FPGA logic device in the bank to start the memory access for the current FPGA logic 
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device. At the completion of the shifts through a given set of chain, the last logic device 
generates a LAST signal (i.e., LASTL or LASTH) to the CTRL_FPGA unit 1200. For the high 
bank, logic device 1202 generates a LASTH shiftout signal on line 1218 to the CTRL FPGA unit 
1200, and for the low bank, logic device 1204 generates a LASTL signal on line 1217 to the 
5 CTRLFPGA unit 1200. 

With respect to board implementation and FIG. 56, one embodiment of the present 
invention incorporates the components (e.g., logic devices 1201-1204, memory devices 1205- 
1206, and CTRL FPGA unit 1200) and buses (e.g., FPGA buses 1212-1213 and local bus 1210) 
in one board. This one board is coupled to the motherboard via motherboard connectors. Thus, 

10 in one board, four logic devices (two on each bank), two memory devices (one on each bank), 
and buses are provided. A second board would contain its own complement of logic devices 
(typically four), memory devices (typically two), FPGA I/O controller (CTRL_FPGA unit) and 
buses. The PCI controller, however, would be installed on the first board only. Inter-board 
connectors, as discussed above, are provided between the boards so that the logic devices in all 

15 the boards can be connected together and conmiunicate with each other during the evaluation 
period, and the local bus is provided across all the boards. The FPGA buses FD[63:0] are 
provided only in each board but not across multiple boards. 

In this board configuration, the Simulation system performs memory mapping between 
logic devices and memory devices in each board. Memory mapping across different boards is not 

20 provided. Thus, logic devices in boardS map memory blocks to memory devices in board5 only, 
not to memory devices on other boards. In other embodiments, however, the Simulation system 
maps memory blocks from logic devices on one board to memory devices on another board. 

The operation of the memory Simulation system in accordance with one embodiment of 
the present invention is generally as follows. The Simulation write/read cycle is divided into 

25 three periods - DMA data transfer, evaluation, and memory access. To indicate the completion 
of a Simulation write/read cycle, the memory Simulation system can send and receive the DONE 
signal on line 1209 to the CTRL_FPGA unit 1200 and the computing system. The DATAXSFR 
signal on bus 1211 indicates the occurrence of the DMA data transfer period where the 
computing system and the FPGA logic devices 1201-1204 are transferring data to each other via 
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the FPGA data bus, high bank bus (FD[63:32]) 1212 and low bank bus (FD[31:0]) 1213. In 
general, DMA transfer occurs between the host computing system and the FPGA logic devices. 
For initialization and memory content dump, the DMA transfer is between the host computing 
system and the SRAM memory devices 1205 and 1206. 
5 During the evaluation period, logic circuitry in each FPGA logic device 1201-1204 

generates the proper software clock, input enable, and mux enable signals to the user's design 
logic for data evaluation. Inter-FPGA logic device conununication occurs in this period. The 
CTRL_FPGA unit 1200 also begins an evaluation counter to control the duration of the 
evaluation period. The number of counts, and hence the duration of the evaluation period, is set 
filO by the system by determining the longest path of the signals. The path length is associated with a 
^ specific number of steps. The system uses the step information and calculates the number of 
in counts necessary to enable the evaluation cycle to run to its completion. 
%j During the memory access period, the memory Simulation system waits for the high and 

f^] low bank FPGA logic devices 1201-1204 to put their respective address and control signals onto 

j;^ 15 their respective FPGA data buses. These address and control signals are latched in by the 

O 

sQ CTRL FPGA unit 1200. If the operation is a write, address, control, and data signals are 
III transported from the FPGA logic devices 1201-1204 to their respective SRAM memory devices 
H 1205 and 1206. If the operation is a read, address and control signals are transported from the 
FPGA logic devices 1201-1204 to their respective SRAM memory devices 1205 and 1206, and 
20 data signals are transported from the SRAM memory devices 1205, 1205 to their respective 
FPGA logic devices 1201-1204. At the FPGA logic device side, the FD bus driver places the 
address and control signals of a memory block onto the FPGA data bus (FD bus). If the 
operation is a write, the write data is placed on the FD bus for that memory block. If the 
operation is a read, the double buffer latches in the data for the memory block on the FD bus 
25 from the SRAM memory device. This operation contmues for each memory block in each FPGA 
logic device in sequential order one memory block at a time. When all the desired memory 
blocks in an FPGA logic device has been accessed, the memory Shnulation system proceeds to 
the next FPGA logic device in each bank and begins accessing the memory blocks in that FPGA 
logic device. After all desired memory blocks in all FPGA logic devices 1201-1204 have been 
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accessed, the memory Simulation write/read cycle is complete and the memory Simulation system 
is idle until the onset of the next memory Simulation write/read cycle. 

FIG. 57 shows a more detailed block diagram of the memory Simulation aspect of the 
present invention, including a more detailed structural diagram of the CTRL_FPGA 1200 and 
5 each logic device that are relevant to memory Simulation. FIG. 57 shows the CTRL_FPGA 1200 
and a portion of the logic device 1203 (which is structurally similar to that of the other logic 
devices 1201, 1202, and 1204). The CTRL FPGA 1200 includes the memory finite state 
machine (MEMFSM) 1240, AND gate 1241, evaluation (EVAL) counter 1242, a low bank 
memory address/control latch 1243, a low bank address/control multiplexer 1244, address 
QlO counter 1245, a high bank memory address/control latch 1247, and a high bank address/control 
^ multiplexer 1246. Each logic device, such as logic device 1203 shown here in FIG. 57, includes 
^ an evaluation finite state machine (EVALFSMx) 1248, data bus multiplexer (FDO_MUXx for the 
' J FPGAO logic device 1203) 1249. The "x" designation appended to the end of EVALFSM 

identifies the particular logic device (FPGAO, FPGAl, FPGA2, FPGA3) with which it is 
%,15 associated, where "x" is a number from 0 to 3 in this example. Thus, EVALFSMO is associated 
& with the FPGAO logic device 1203. In general, each logic device is associated with some number 
fij X and as N logic devices are used, the "x" represents a number from 0 to N-1. 
2 In each logic device 1201-1204, numerous memory blocks are associated with the 

configured and mapped user design. Thus, memory block interface 1253 in the user's logic 
20 provides a means for the computing system to access the desired memory block in the array of 
FPGA logic devices. The memory block interface 1253 also provides memory write data on bus 
1295 to the FPGA data bus multiplexer (FDO MUXx) 1249 and receives memory read data on 
bus 1297 from the memory read data double buffer 1251. 

A memory block data/logic interface 1298 is provided in each FPGA logic device. Each 
25 of these memory block data/logic interface 1298 is coupled to the FPGA data bus multiplexer 
(FDO MUXx) 1249, the evaluation finite state machine (EVALFSMx) 1248, and the FPGA bus 
FD[63:0]. The memory block data/logic interface 1298 includes a memory read data double 
buffer 1251, the address offset unit 1250, the memory model 1252, and the memory block 
interface for each memory block N (mem block N) 1253 which are all repeated in any given 
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FPGA logic device 1201-1204 for each memory block N. Thus, for five memory blocks, five 
sets of the memory block data/logic interface 1298 are provided; that is, five sets of the memory 
read data double buffer 1251, the address offset unit 1250, the memory model 1252, and the 
memory block interface for each memory block N (mem_block_N) 1253 are provided. 

Like EVALFSMx, the "x" in FDO MUXx identifies the particular logic device (FPGAO, 
FPGAl, FPGA2, FPGA3) with which it is associated, where "x" is a number from 0 to 3. The 
output of FDO_MUXx 1249 is provided on bus 1282 which is coupled to the high bank bus 
FD[63:32] or the low bank bus FD[31:0] depending on which chip (FPGAO, FPGAl, FPGA2, 
FPGA3) is associated with the FDO__MUXx 1249. In FIG. 57, FDO_MUXx is FDO MUXO, 
which is associated with low bank logic device FPGAO 1203. Hence, the output on bus 1282 is 
provided to low bank bus FD[31:0]. Portions of the bus 1283 are used for transporting read data 
from the high bank FD[63:32] or low bank FD[31:0] bus to the read bus 1283 for input to the 
memory read data double buffer 1251. Hence, write data is transported out via FDO MUXO 
1249 from the memory block in each logic device 1201-1204 to the high bank FD[63:32] or low 
bank FD[31;0] bus, and read data is transported in to the memory read data double buffer 1251 
from the high bank FD[63:32] or low bank FD[31:0] bus via read bus 1283. The memory read 
data double buffer provides a double buffered mechanism to latch data in the first buffer and then 
buffered again to get the latched data out at the same time to minimize skew. This memory read 
data double buffer 1251 will be discussed in more detail below. 

Returning to the memory model 1252, it converts the user's memory type to the memory 
Simulation system's SRAM type. Because the memory type m the user's design can vary from 
one type to another, this memory block interface 1253 can also be unique to the user's design. 
For example, the user's memory type may be DRAM, flash memory, or EEPROM. However, in 
all variations of the memory block interface 1253, memory addresses and control signals (e.g., 
read, write, chip select, mem clk) are provided. One embodiment of the memory Sunulation 
aspect of the present invention converts the user's memory type to the SRAM type used in the 
memory Sunulation system. If the user's memory type is SRAM, the conversion to an SRAM 
type memory model is quite simple. Thus, memory addresses and control signals are provided 
on bus 1296 to the memory model 1252, which performs the conversion. 
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The memory model 1252 provides memory block address information on bus 1293 and 
control information on bus 1292. Address offset unit 1250 receives address information for the 
various memory blocks and provides a modified offset address on bus 1291 from the original 
address on bus 1293. The offset is necessary because certain memory blocks' addresses may 
5 overlap each other. For example, one memory block may use and reside in space 0-2K, whereas 
another memory block may use and reside in space 0-3K. Because both memory blocks overlap 
in space 0-2K, individual addressing may be difficult without some sort of address offsetting 
mechanism. Thus, the first memory block may use and reside in space 0-2K, while the second 
memory block may use and reside in the space above 2K and up to 5K. The offset addresses 

, JO from address offset unit 1250 and the control signals on bus 1292 are combined and provided on 

^0 bus 1299 to the FPGA bus multiplexer (FDO MUXx) 1249. 

ijl The FPGA data bus multiplexer FDOJMUXx receives SPACE2 data on bus 1289, 

Ti SPACE3 data on bus 1290, address/control signals on bus 1299, and memory write data on bus 
^ 1295. As described previously, SPACE2 and SPACE3 are specific space indices. The SPACE 
^ 15 index, which is generated by the FPGA I/O controller (item 327 in FIG. 10; FIG. 22), selects the 
gS particular address space (i.e., REG read, REG write, S2H read, H2S write, and CLK write), 
tn Within this address space, the system of the present invention sequentially selects the particular 
p word to be accessed. SPACE2 refers to the memory space dedicated for the DMA read transfer 
for the hardware-to-software H2S data. SPACES refers to the memory space dedicated for the 
20 DMA read transfer for REGISTER READ data. Refer to Table G above. 

As its output, FDO MUXx 1249 provides data on bus 1282 to either the low bank or high 
bank bus. The selector signals are the ou^ut enable (output en) signal on line 1284 and the 
select signal on line 1285 from the EVALFSMx unit 1248. The output enable signal on line 1284 
enables (or disables) the operation of the FDO_MUXx 1249. For data accesses across the FPGA 
25 bus, the output enable signal is enabled to allow the FDO MUXx to function. The select signal 
on line 1285 is generated by the EVALFSMx unit 1248 to select among the plurality of inputs 
from the SPACE2 data on bus 1289, SPACE3 data on bus 1290, address/control signals on bus 
1299, and memory write data on bus 1295. The generation of the select signal by the 
EVALFSMx unit 1248 will be discussed farther below. 
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The EVALFSMx unit 1248 is at the operational core of each logic device 1201-1204 with 
respect to the memory Simulation system. The EVALFSMx unit 1248 receives as its inputs the 
SHIFTIN signal on line 1279, the EVAL signal from the CTRL^FPGA unit 1200 on line 1274, 
and a write signal wrx on line 1287, The EVALFSMx unit 1248 outputs the SHIFTOUT signal 
5 on line 1280, the read latch signal rdjatx on line 1286 to the memory read data double buffer 
1251, the output enable signal on line 1284 to the FDO_MUXx 1249, the select signal on line 
1285 to the FDO MUXx 1249, and three signals to the user's logic (input-en, mux_en, and 
clk_en) on lines 1281. 

The operation of the FPGA logic devices 1201-1204 for the memory Smiulation system in 
ClO accordance with one embodiment of the present invention is generally as follows. When the 
if EVAL signal is at logic 1, data evaluation within the FPGA logic devices 1201-1204 takes place; 
in otherwise, the Simulation system is performing either DMA data transfer or memory access. At 
S| EVAL= 1, the EVALFSMx unit 1248 generates the clk_en signal, the input^en signal, and the 
mux_en signal to allow the user's logic to evaluate the data, latch relevant data, and multiplex 
=1,15 signals across logic devices, respectively. The EVALFSMx unit 1248 generates the clk en signal 
|B to enable the second flip-flop of all the clock edge register flip-flops in the user's design logic 
kj (see FIG. 19). The clk^en signal is otherwise known as the software clock. If the user's 
H memory type is synchronous, clk_en also enables the second clock of the memory read data 

double buffer 1251 in each memory block. The EVALFSMx unit 1248 generates the input en 
20 signal to the user's design logic to latch the input signals sent from the CPU by DMA transfer to 
the user's logic. The input_en signal provides the enable input to the second flip-flop in the 
primary clock register (see FIG. 19). Finally, the EVALFSMx unit 1248 generates the mux_en 
signal to turn on the multiplexing circuit m each FPGA logic device to start the communication 
with other FPGA logic devices in the array. 
25 Thereafter, if the FPGA logic devices 1201-1204 contain at least one memory block, the 

memory Smiulation system waits for the selected data to be shifted in to the selected FPGA logic 
device and then generates the output en and select signals for the FPGA data bus driver to put the 
address and control signals of the memory block interface 1253 (mem_block_N) on the FD bus. 
If the write signal wrx on line 1287 is enabled (i.e., logic 1), then the select and output en 
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signals are enabled to place the write data onto either the low or high bank bus, depending on 
which bank the FPGA chip is coupled. In FIG. 57, logic device 1203 is FPGAO and is coupled 
to the low bank bus FD[31:0]. If the write signal wrx on line 1287 is disabled (i.e., logic 0), then 
the select and output_en signals are disabled and the read latch signal rd_latx on line 1286 to let 
5 the memory read data double buffer 1251 latch and double buffer the selected data from the 
SRAM via either the low or high bank bus, depending on which bank the FPGA chip is coupled. 
The wrx signal is the memory write signal which is derived from the memory interface of the of 
the user's design logic. Indeed, the wrx signal on line 1287 comes from memory model 1252 via 
control bus 1292. 

f40 This process of reading or writing data occurs for each FPGA logic device. After all 

^ memory blocks have been processed via SRAM access, the EVALFSMx unit 1248 generates the 
Ul SHIFTOUT signal to allow SRAM access by the next FPGA logic device in the chain. Note that 
\| the memory access for the devices on the high and low banks occur in parallel. At times, the 
Q memory access for one bank may complete before the memory access for the other bank. For all 
:^ 15 of these accesses, appropriate wait cycles are inserted so that logic processes data only when it is 
iO ready and data is available. 

m On the CTRL_FPGA unit 1200 side, the MEMFSM 1240 is at the core of the memory 

Simulation aspect of the present invention. It sends and receives many control signals to control 
the activation of the memory Simulation write/read cycles and the control of the various 

20 operations supported by the cycles. The MEMFSM 1240 receives the DATAXSFR signal on line 
1260 via line 1258. This signal is also provided to each logic device on line 1273. When 
DATAXSFR goes low (i.e., logic low), the DMA data transfer period ends and the evaluation 
and memory access periods begin. 

The MEMFSM 1240 also receives a LASTH signal on line 1254 and a LASTL signal on 

25 line 1255 to indicate that the selected word associated with the selected address space has been 
accessed between the computing system and the Simulation system via the PCI bus and the FPGA 
bus. The MOVE signal associated with this shift out process is propagated through each logic 
device (e.g., logic device 1201-1204) until the desired word has been accessed and the MOVE 
signal ultimately becomes the LAST signal (i.e., LASTH for the high bank and LASTL for the 
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low bank) at the end of the chain. In the EVALFSM 1248 (i.e., FIG. 57 shows the EVALFSMO 
for the FPGAO logic device 1203), the corresponding LAST signal is the SHIFTOUT signal on 
line 1280, Because the particular logic device 1203 is not the last logic device in the low bank 
chain as shown in FIG. 56 where logic device 1204 is the last logic device in the low bank chain, 
5 the SHIFTOUT signal for EVALFSMO is not the LAST signal. If the EVALFSM 1248 

corresponds to EVALFSM2 m FIG. 56, then the SHIFTOUT signal on line 1280 is the LASTL 
signal provided to line 1255 to the MEMFSM. Otherwise, the SHIFTOUT signal on line 1280 is 
provided to logic device 1204 (see FIG. 56). Similarly, the SHIFTIN signal on line 1279 
represents Vcc for the FPGAO logic device (see FIG. 56) 1203. 
aO The LASTL and LASTH signals are input to AND gate 1241 via lines 1256 and 1257, 

[f^ respectively. AND gate 1241 provides an open drain. The output of the AND gate 1241 

generates the DONE signal on line 1259, which is provided to the computing system and the 
HI MEMFSM 1240. Thus, only when both the LASTL and LASTH signals are logic high to 
I n indicate the end of the shifted out chain process will the AND gate oxitput a logic high. 

15 The MEMFSM 1240 generates a start signal on line 1261 to the EVAL counter 1242. As 

- 1 the name implies, the start signal triggers the start of the EVAL counter 1242 and is sent after the 
fy completion of the DMA data transfer period. The start signal is generated upon the detection of a 
H high to low (1 to 0) transition of the DATAXSFR signal. The EVAL counter 1242 is a 

programmable counter that counts a predetermined number of clock cycles. The duration of the 
20 programmed counts in the EVAL counter 1242 determines the duration of the evaluation period. 
The output of the EVAL counter 1242 on line 1274 is either a logic level 1 or 0 depending on 
whether the counter is counting or not. When the EVAL counter 1242 is counting, the output on 
line 1274 is at logic 1, which is provided to each FPGA logic device 1201-1204 via EVALFSMx 
1248. When EVAL=1, the FPGA logic devices 1201-1204 perform inter FPGA communication 
25 to evaluate data in the user's design. The output of the EVAL counter 1242 is also fed back on 
line 1262 to the MEMFSM unit 1240 for its own tracking purposes. At the end of the 
programmed counts, the EVAL counter 1242 generates a logic 0 signal on lines 1274 and 1262 to 
indicate the end of the evaluation period. 

If memory access is not desired, the MEM_EN signal on line 1272 is asserted at logic 0 
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and provided to the MEMFSM unit 1240, in which case the memory Simulation system waits for 
another DMA data transfer period. If memory access is desired, the MEM EN signal on line 
1272 is asserted at logic 1. In essence, the MEM_EN signal is a control signal from the CPU to 
enable the on-board SRAM memory device for accessing the FPGA logic devices. Here, the 
5 MEMFSM unit 1240 waits for the FPGA logic devices 1201-1204 to place the address and 
control signals on the FPGA bus, FD[63:32] and FD[31:0]. 

The remainder of the functional units and their associated control signals and lines are for 
providing address/control information to the SRAM memory devices for writing and reading 
data. These units include the memory address/control latch 1243 for the low bank, the address 
p40 control mux 1244 for the low bank, the memory address/control latch 1247 for the high bank, the 
^ address control mux 1246 for the high bank, and the address counter 1245. 
iff The memory address/control latch 1243 for the low bank receives address and control 

Ti signals from the FPGA bus FD[31;0] 1275, which coincides with bus 1213, and a latch signal on 

f^' line 1263. The latch 1243 generates mem_wr_L signal on line 1264 and provides the incoming 

u I 

15 address/control signals from FPGA bus FD[31:0] to the address/control mux 1244 via bus 1266. 

CJ 

This mem_wr signal is the same as the chip select write signal, 
ifr The address/control mux 1244 receives as inputs the address and control information on 

p bus 1266 and the address information from address counter 1245 via bus 1268. As output, it 

sends address/control information on bus 1276 to the low bank SRAM memory device 1205. The 
20 select signal on line 1265 provides the proper selection signal from the MEMFSM unit 1240. 
The address/control information on bus 1276 corresponds to the MA[18:2] and chip select 
read/write signals on buses 1229 and 1216 in FIG. 56. 

The address counter 1245 receives information from SPACE4 and SPACE5 via bus 1267. 
SPACE4 mcludes the DMA write transfer information. SPACE5 includes the DMA read 
25 transfer information. Thus, these DMA transfers occur between the computing system 

(cache/main memory via the workstation CPU) and the Simulation system (SRAM memory 
devices 1205, 1206) across the PCI bus. The address counter 1245 provides its output to bus 
1288 and 1268 to address/control muxes 1244 and 1246. With the appropriate select signal on 
line 1265 for the low bank, the address/control mux 1244 places on bus 1276 either the 
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address/control information on bus 1266 for write/read memory access between the SRAM 
devices 1205 and the FPGA logic devices 1203, 1204, or alternatively, the DMA write/read 
transfer data from SPACE4 or SPACES on bus 1267. 

During the memory access period, the MEMFSM unit 1240 provides the latch signal on 
5 line 1263 to the memory address/control latch 1243 to fetch the inputs from the FPGA bus 
FD[31:0]. The MEMFSM unit 1240 extracts the mem_wr_L control information from the 
address/control signals on FD[31:0] for further control. If the mem_wr_L signal on line 1264 is 
a logic 1, a write operation is desired and the appropriate select signal on line 1265 is generated 
by the MEMFSM unit 1240 to the address/control mux 1244 so that the address and control 
^iO signals on bus 1266 are sent to the low bank SRAM on bus 1276. Thereafter, a write data 
v3 transfer occurs from the FPGA logic devices to the SRAM memory devices. If the mem wr L 
,g signal on line 1264 is a logic 0, a read operation is desired so the Simulation system waits for 
data on the FPGA bus FD[31:0] placed there by the SRAM memory device. As soon as data is 
ready, the read data transfer occurs from the SRAM memory devices to the FPGA logic devices. 
015 A similar configuration and operation for the high bank are provided. The memory 

il address/control latch 1247 for the high bank receives address and control signals from the FPGA 
2 bus FD[63:32] 1278, which coincides with bus 1212, and a latch signal on line 1270, The latch 
f =^ 1270 generates mem_wr_H signal on line 1271 and provides the incoming address/control signals 
from FPGA bus FD[63:32] to the address/control mux 1246 via bus 1239. 
20 The address/control mux 1246 receives as inputs the address and control information on 

bus 1239 and the address information from address counter 1245 via bus 1268. As output, it 
sends address/control information on bus 1277 to the high bank SRAM memory device 1206. 
The select signal on line 1269 provides the proper selection signal from the MEMFSM unit 1240. 
The address/control information on bus 1277 corresponds to the MA[18:2] and chip select 
25 read/write signals on buses 1214 and 1215 in FIG. 56. 

The address counter 1245 receives information from SPACE4 and SPACE5 via bus 1267 
as mentioned above for DMA write and read transfers. The address counter 1245 provides its 
output to bus 1288 and 1268 to address/control muxes 1244 and 1246. With the appropriate select 
signal on line 1269 for the high bank, the address/control mux 1246 places on bus 1277 either the 
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address/control information on bus 1239 for write/read memory access between the SRAM 
devices 1206 and the FPGA logic devices 1201, 1202, or alternatively, the DMA write/read 
transfer data from SPACE4 or SPACES on bus 1267. 

During the memory access period, the MEMFSM unit 1240 provides the latch signal on 
5 line 1270 to the memory address/control latch 1247 to fetch the inputs from the FPGA bus 
FD[63:32]. The MEMFSM unit 1240 extracts the mem_wr_H control information from the 
address/control signals on FD[63:32] for further control. If the mem_wr_H signal on line 1271 
is a logic 1, a write operation is desured and the appropriate select signal on line 1269 is 
generated by the MEMFSM unit 1240 to the address/control mux 1246 so that the address and 

JlO control signals on bus 1239 are sent to the high bank SRAM on bus 1277. Thereafter, a write 
data transfer occurs from the FPGA logic devices to the SRAM memory devices. If the 

^ mem_wr_H signal on line 1271 is a logic 0, a read operation is desired so the Sunulation system 

1^ waits for data on the FPGA bus FD[63:32] placed there by the SRAM memory device. As soon 
as data is ready, the read data transfer occurs from the SRAM memory devices to the FPGA logic 

^15 devices. 

U As shown in FIG. 57, address and control signals are provided to low bank SRAM 

f5 memory device and the high bank memory device via bus 1276 and 1277, respectively. The bus 
1276 for the low bank corresponds to the combination of the buses 1229 and 1216 in FIG. 56. 
Similarly, the bus 1277 for the high bank corresponds to the combination of the buses 1214 and 
20 1215 in FIG. 56. 

The operation of the CTRL^FPGA unit 1200 for the memory Simulation system in 
accordance with one embodiment of the present invention is generally as follows. The DONE 
signal on line 1259, which is provided to the computing system and the MEMFSM unit 1240 in 
the CTRL_FPGA unit 1200 indicates the completion of a Simulation write/read cycle. The 
25 DATAXSFR signal on line 1260 indicates the occurrence of the DMA data transfer period of the 
Simulation write/read cycle. Memory address/control signals on both of the FPGA bus FD[31:0] 
and FD[63:32] are provided to the memory address/control latch 1243 and 1247 for the high and 
low banks, respectively. For either bank, MEMFSM unit 1240 generates the latch signal (1263 
or 1269) to latch the address and control hiformation. This information is then provided to the 
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SRAM memory devices. The mem wr signal is used to determine if a write or a read operation 
is desired. If a write is desired, data is transferred from the FPGA logic devices 1201-1204 to 
the SRAM memory devices via the FPGA bus. If a read is desired, the Simulation system waits 
for the SRAM memory device to put the requested data onto the FPGA bus for transfer between 
5 the SRAM memory device to the FPGA logic devices. For DMA data transfers of SPACE4 and 
SPACES, the select signal on lines 1265, 1269 can select the output of the address counter 1245 
as the data to be transferred between the main computing system and the SRAM memory devices 
in the Simulation system. For all of these accesses, appropriate wait cycles are inserted so that 
logic processes data only when it is ready and data is available. 
."pIIO fig. 60 shows a more detailed view of the memory read data double buffer 1251 (FIG. 

57). Each memory block N in each FPGA logic device has a double buffer to latch in the 
relevant data which may be coming in at different times, and then finally buffering out this 
ii relevant latched data at the same time. In FIG. 60, double bufferl391 for memory block 0 
^ includes two D-type flip-flops 1340 and 1341. The output 1343 of the first D flip-flop 1340 is 
0 15 coupled to the input of the second D flip-flop 1341. The output 1344 of the second D flip-flop 
U 1341 is the output of the double buffer, which is provided to the memory block N interface in the 
f? user's design logic. The global clock input is provided to the first flip-flop 1340 on line 1393 
^ and the second flip-flop 1341 on line 1394. 

The first D flip-flop 1340 receives on line 1342 its data input from the SRAM memory 
20 devices via bus 1283 and the FPGA bus FD[63:32] for the high bank and FD[31:0] for the low 
bank. The enable input is coupled to line 1345 which receives the rdjatx (e.g., rdJatO) signal 
from the EVALFSMx unit for each FPGA logic device. Thus, for read operations (i.e,, wrx=0), 
the EVALFSMx unit generates the rd_latx signal to latch m the data on line 1342 to line 1343. 
The input data for all the double buffers of all memory blocks may come in at different times, the 
25 double buffer ensures that all of the data is latched in first. Once all the data is latched in to D 
flip-flop 1340, the clk_en signal (i.e., the software clock) is provided on line 1346 as the clock 
input to the second D flip-flop 1341. When the clk_en signal is asserted, the latched data on line 
1343 is buffered into D flip-flop 1341 to line 1344. 

For the next memory block 1, another double buffer 1392 substantially equivalent to 
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double buffer 1391 is provided. The data from the SRAM memory devices are input on line 
1396, The global clock signal is input on line 1397. The clk_en (software clock) signal is input 
to the second flip-flop (not shown) in the double buffer 1392 on line 1398. These lines are 
coupled to analogous signal lines for the first double buffer 1391 for memory block 0 and all 
5 other double buffers for other memory blocks N. The output double buffered data is provided on 
line 1399. 

The rdjatx signal (e.g., rdjatl) for the second double buffer 1392 is provided on line 
1395 separately from other rd latx signals for other double buffers. More double buffers are 
provided for other memory blocks N. 
510 The state diagram of the MEMFSM unit 1240 will now be discussed m accordance with 

fi one embodiment of the present invention. FIG. 58 shows such a state diagram of the finite state 
:P machine of the MEMFSM unit in the CTRL_FPGA unit. The state diagram in FIG, 58 has been 
I4 structured so that the three periods within the Simulation write/read cycle are also shown with 

' their corresponding states. Thus, states 1300-1301 correspond to the DMA data transfer period; 
'^3 15 states 1302-1304 correspond to the evaluation period; and states 1305-1314 correspond to the 
H memory access period. Refer to FIG. 57 in conjunction with FIG. 58 in the discussion below. 
S Generally, the sequence of signals for the DMA transfer, evaluation, and memory access 

is set. In one embodiment, the sequence is as follows: DATA_XSFR triggers the DMA data 
transfer, if any. The LAST signals for both high and low banks are generated at the completion 
20 of the DMA data transfer and trigger the DONE signal to indicate the completion of the 

completion of the DMA data transfer period. The XSFR_DONE signal is then generated and the 
EVAL cycle then begms. At the conclusion of EVAL, memory read/write can begin. 

Turning to the top of FIG. 58, state 1300 is idle whenever the DATAXSFR signal is at 
logic 0. This indicates that no DMA data transfers are occurring at the moment. When the 
25 DATAXSFR signal is at logic 1, the MEMFSM unit 1240 proceeds to state 1301. Here, the 
computing system requires DMA data transfer between the computing system (main memory in 
FIGS. 1, 45, and 46) and the Simulation system (FPGA logic devices 1201-1204 or SRAM 
memory device 1205, 1206 in FIG. 56). Appropriate wait cycles are inserted until the DMA data 
transfer is complete. When the DMA transfer has completed, the DATAXSFR signal returns to 

241 

SV/225583.01 
16503302504 




logic 0. 

When the DATAXSFR signal returns to logic 0, the generation of the start signal is 
triggered in the MEMFSM unit 1240 at state 1302. The start signal starts the EVAL counter 
1242, which is a programmable counter. The duration of the programmed counts in the EVAL 
5 counter is equivalent to the duration of the evaluation period. So long as the EVAL counter is 
counting at state 1303, the EVAL signal is asserted at logic 1 and provided to the EVALFSMx in 
each FPGA logic device as well as the MEMFSM unit 1240. At the end of the count, the EVAL 
counter presents the EVAL signal at logic 0 to the EVALFSMx in each FPGA logic device and 
the MEMFSM unit 1240. When the MEMFSM unit 1240 receives the logic 0 EVAL signal, it 
mo turns on the EVALJDONE flag at state 1304. The EVAL DONE flag is used by MEMFSM to 
Iff indicate that the evaluation period has ended and the memory access period, if desired, can now 
J proceed. The CPU will check the EVALJ)ONE and XSFR_DONE by reading the 
^ XSFR^EVAL register (see Table K below) to confirm that DMA transfer and EVAL has 

completed successfully before starting the next DMA transfer. 
% 15 However, in some cases, the Simulation system may not want to perform memory access 

^ at the moment. Here, the Simulation system keeps the memory enable signal MEM_EN at logic 
O 0. This disabled (logic 0) MEM EN signal keeps the MEMFSM unit at idle state 1300, where it 
is waiting for DMA data transfer or evaluation of data by the FPGA logic devices. On the other 
hand, if the memory enable signal MEM_EN is at logic 1, the Simulation system is indicating the 
20 desire to conduct memory access. 

Below state 1304 in FIG. 58, the state diagram is divided into two sections which proceed 
in parallel. One section contains states 1305, 1306, 1307, 1308, and 1309 for the low bank 
memory access. The other section contains states 1311, 1312, 1313, 1314, and 1309 for the high 
bank memory access. 

25 At state 1305, the Simulation system waits one cycle for the currently selected FPGA 

logic device to place the address and control signals on the FPGA bus FD[31:0]. At state 1306, 
the MEMFSM generates the latch signal on line 1263 to the memory address/control latch 1243 
to fetch inputs from the FD[31:0]. The data corresponding to this particular fetched address and 
control signal will either be read from the SRAM memory device or written to the SRAM 
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memory device. To determine if the Simulation system requires a read operation or a write 
operation, the memory write signal mem_wr_L for the low bank will be extracted from the 
address and control signals. If mem_wr_L= 0, a read operation is requested. If mem_wr_L= 
1, then a write operation is requested. As stated previously, this mem wr signal is equivalent to 
5 the chip select write signal. 

At state 1307, the proper select signal for the address/control mux 1244 is generated to 
send address and control signals to the low bank SRAM. The MEMFSM unit checks the 
mem_wr signal and the LASTL signal. If mem_wr_L= 1 and LASTL=0, a write operation is 
requested but the last data in the chain of FPGA logic devices has not been shifted out yet. Thus, 
JilO the Simulation system returns to state 1305 where it waits one cycle for the FPGA logic device to 
put more address and control signals on FD[31:0]. This process continues until the last data has 
been shifted out of the FPGA logic devices. If, however, mem_wr_L= 1 and LASTL = 1, the last 
data has been shifted out of the FPGA logic devices. 

Similarly, if mem_wr_L=0 indicating a read operation, the MEMFSM proceeds to state 
J 15 1308. At state 1308, the Simulation system waits one cycle for the SRAM memory device to put 
the data onto the FPGA bus FD[31:0], If LASTL=0, the last data in the chain of FPGA logic 
devices has not been shifted out yet. Thus, the Simulation system returns to state 1305 where it 
waits one cycle for the FPGA logic device to put more address and control signals on FD[31:0]. 
This process continues until the last data has been shifted out of the FPGA logic devices. Note 
20 that write operations (mem_wr_L= 1) and read operations (mem_wr_L=0) can be interleaved or 
otherwise alternate until LASTL= 1. 

When LASTL= 1, the MEMFSM proceeds to state 1309 where it waits while DONE=0. 
When DONE = 1, both LASTL and LASTH are at logic 1 and thus, the Simulation write/read 
cycle has completed. The Simulation system then proceeds to state 1300 where it remains idle 
25 whenever DATAXSFR=0. 

The same process is applicable for the high bank. At state 1311, the Simulation system 
waits one cycle for the currently selected FPGA logic device to place the address and control 
signals on the FPGA bus FD[63:32]. At state 1312, the MEMFSM generates the latch signal on 
line 1270 to the memory address/control latch 1247 to fetch inputs from the FD[63:32]. The data 
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corresponding to this particular fetched address and control signal will either be read from the 
SRAM memory device or written to the SRAM memory device. To determine if the Simulation 
system requires a read operation or a write operation, the memory write signal mem_wr_H for 
the high bank will be extracted from the address and control signals. If mem_wr_H== 0, a read 
5 operation is requested. If mem_wr_H= 1, then a write operation is requested. 

At state 1313, the proper select signal for the address/control mux 1246 is generated to 
send address and control signals to the high bank SRAM. The MEMFSM unit checks the 
mem wr signal and the LASTH signal. If mem_wr_H= 1 and LASTH=0, a write operation is 
requested but the last data in the chain of FPGA logic devices has not been shifted out yet. Thus, 
"^SlO the Simulation system returns to state 1311 where it waits one cycle for the FPGA logic device to 
put more address and control signals on FD[63:32]. This process continues until the last data has 

IJI 

been shifted out of the FPGA logic devices. If, however, meni_wTjH[= 1 and LASTH = 1 , the 
il last data has been shifted out of the FPGA logic devices. 

Similarly, if mem_wr_H=0 indicating a read operation, the MEMFSM proceeds to state 
S 15 1314. At state 1314, the Simulation system waits one cycle for the SRAM memory device to put 
p the data onto the FPGA bus FD[63:32]. If LASTH=0, the last data in the chain of FPGA logic 
devices has not been shifted out yet. Thus, the Simulation system returns to state 1311 where it 
H waits one cycle for the FPGA logic device to put more address and control signals on FD[63:32]. 
This process continues until the last data has been shifted out of the FPGA logic devices. Note 
20 that write operations (mem_wr_H = 1) and read operations (mem_wr_H=0) can be interleaved or 
otherwise alternate until LASTH = 1. 

When LASTH = 1, the MEMFSM proceeds to state 1309 where it waits while DONE=0. 
When D0NE = 1, both LASTL and LASTH are at logic 1 and thus, the Simulation write/read 
cycle has completed. The Simulation system then proceeds to state 1300 where it remains idle 
25 whenever DATAXSFR=0. 

Alternatively, for both the high bank and the low bank, states 1309 and 1310 are not 
implemented in accordance with another embodiment of the present invention. Thus, in the low 
bank, the MEMFSM will proceed directly to state 1300 after passing states 1308 (LASTL=1) or 
1307 (MEM_WR_L= 1 and LASTL= 1). In the high bank, the MEMFSM will proceed directly 
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to state 1300 after passing states 1314 (LASTH = 1) or 1313 (MEM_WR_H = 1 and LASTH = 1). 

The state diagrana of the EVALFSM imit 1248 will now be discussed in accordance with 
one embodiment of the present invention. FIG. 59 shows such a state diagram of the 
EVALFSMx finite state machine in each FPGA chip. Like FIG. 58, the state diagram in FIG. 59 
5 has been structured so that two periods within the Simulation write/read cycle are also shown 
with their corresponding states. Thus, states 1320-1326A correspond to the evaluation period, 
and states 1326B-1336 correspond to the memory access period. Refer to FIG. 57 in conjunction 
with FIG. 59 in the discussion below. 

The EVALFSMx unit 1248 receives the EVAL signal on line 1274 from the 
5lO CTRLJFPGA unit 1200 (see HG. 57). While EVAL=0, no evaluation of data by the FPGA 

logic devices is occurring. Thus, at state 1320, the EVALFSMx is idle while EVAL=0. When 
^ EVAL= 1, EVALFSMx proceeds to state 1321. 

States 1321, 1322, and 1323 relate to inter-FPGA communication where data is evaluated 
'0 by the user's design via the FPGA logic devices. Here, EVALFSMx generates the signals 
C3 15 input en, mux en, and clk en (item 1281 in FIG, 57) to the user's logic. At state 1321, 
|T EVALFSMx generates the clk en signal, which enables the second flip-flop of all the clock edge 
^ register flip-flops in the user's design logic in this cycle (see FIG. 19). The clk en signal is 
H otherwise known as the software clock. If the user's memory type is synchronous, clk en also 
enables the second clock of the memory read data double buffer 1251 in each memory block. 
20 The SRAM data output for each memory block are sent to the user's design logic in this cycle. 

At state 1322, the EVALFSMx generates the input en signal to the user's design logic to 
latch the input signals sent from the CPU by DMA transfer to the user's logic. The input en 
signal provides the enable input to the second flip-flop in the primary clock register (see FIG. 
19). 

25 At state 1323, EVALFSMx generates the mux en signal to turn on the multiplexing 

circuit in each FPGA logic device to start the communication with other FPGA logic devices in 
the array. As explained earlier, inter-FPGA wire lines are often multiplexed to efficiently utilize 
the limited pin resources in each FPGA logic device chip. 

At state 1324, EVALFSM waits for as long as EVAL=1. When EVAL=0, the 
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evaluation period has completed and so, state 1325 requires that EVALFSMx turn off the mux_en 
signal. 

If the number of memory blocks M (where M is an integer, including 0) is zero, the 
EVALFSMx returns to state 1320, where it remains idle if EVAL=0. In most cases, M>0 and 
5 thus, EVALFSMx proceeds to state 1326A/1326B. " M" is the number of memory blocks in the 
FPGA logic device. It is a constant from the user's design mapped and configured in the FPGA 
logic device; it does not count down. If M > 0, the right portion (memory access period) of FIG. 
59 will be configured in the FPGA logic devices. If M=0, only the left portion (EVAL period) 
of FIG. 59 will be configured. 
^10 State 1327 keeps the EVALFSMx in a wait state as long as SHIFTIN=0. When 

^0 SHIFTIN = 1, the previous FPGA logic device has completed its memory access and the current 
£ FPGA logic device is now ready to perform its memory access tasks. Alternatively, when 
i^f SHIFTIN= 1, the current FPGA logic device is the first logic device in the bank and the 
W SHIFTIN input line is coupled to Vcc. Regardless, the receipt of the SHIFTIN== 1 signal 
O 15 indicates that the current FPGA logic device is ready to perform memory access. At state 1328, 
2 the memory block number N is set at N = 1. This number N will be incremented at the 
is occurrence of each loop so that memory access for that particular memory block N can be 
1^ accomplished. Initially, N = 1 and so, EVALFSMx will proceed to access memory for memory 
block 1. 

20 At state 1329, EVALFSMx generates the select signal on line 1285 and the output__en 

signal on line 1284 to the FPGA bus driver FDO_MUXx 1249 to put the address and control 
signals of the Mem Block N interface 1253 onto the FPGA bus FD[63:32] or FD[31:0]. If a 
write operation is required, wr= 1; otherwise, a read operation is required so wr=0. The 
EVALFSMx receives as one of its inputs the wr signal on line 1287. Based on this wr signal, the 

25 proper select signal on line 1285 will be asserted. 

When wr= 1, the EVALFSMx proceeds to state 1330. EVALFSMx generates the select 
and output_en signals for the FD bus driver to put the write data of the Mem_Block_N 1253 on 
the FPGA bus FD[63:32] or FD[31:0]. Thereafter, EVALFSMx waits one cycle to let the 
SRAM memory device to complete the write cycle. EVALFSMx then goes to state 1335 where 
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the memory block number N is incremented by one; that is, N=N + L 

However, if wr=0 at state 1329, a read operation is requested and EVALFSMx goes to 

state 1332 where it waits one cycle and then to state 1333 where it waits another cycle. At state 

1334, EVALFSMx generates the rdjatch signal on line 1286 to let the memory read data double 
5 buffer 1251 of memory block N fetch the SRAM data out onto the FD bus. EVALFSMx then 

proceeds to state 1335, where the memory block number N is incremented by one; that is, 

N=N + 1. Thus, if N = 1 prior to the incrementing state 1335, N is now 2 so that subsequent 

memory accesses will be applicable for memory block 2. 

If the number of the current memory block N is less than or equal to the total number of 
yiO memory blocks M in the user's design (i.e., N<M), the EVALFSMx proceeds to state 1329, 
a where it generates the particular select and output_en signals for the FD bus driver based on 
^ whether the operation is a write or a read. Then, the write or read operation for this next 

memory block N will take place. 
Vi If, however, the number of the current memory block N is greater than the total number 

□ 15 of memory blocks M in the user's design (i.e. , N > M), the EVALFSMx proceeds to state 1336, 
2 where it turns on the SHIFTOUT output signal to allow the next FPGA logic device in the bank 
iiJ to access the SRAM memory devices. Thereafter, EVALFSMx proceeds to state 1320 where it is 
H idle until the Simulation system requires data evaluation among the FPGA logic devices (i.e., 

EVAL=1). 

20 FIG. 61 shows the Simulation write/read cycle in accordance with one embodiment of the 

present invention. FIG. 61 shows at reference numeral 1366 the three periods in the Simulation 
write/read cycle - DMA data transfer period, evaluation period, and memory access period. 
Although not shown, it is implicit that a prior DMA transfer, evaluation, and memory access may 
have taken place. Furthermore, the timing for data transfers to/from the low bank SRAM may 

25 differ from that of the high bank SRAM. For simplicity, FIG. 61 shows one example where the 
access times for the low and high banks are identical. A global clock GCLK 1350 provides the 
clocking signal for all components in the system. 

The DATAXSFR signal 1351 indicates the occurrence of the DMA data transfer period. 
When DATAXSFR= 1 at trace 1367, DMA data transfer is taking place between the mam 
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computing system and the FPGA logic devices or SRAM memory devices. Thus, data is 
provided on the FPGA high bank bus FD[63:32] 1359 and trace 1369, as well as the FPGA low 
bank bus FD[31:0] 1358 and trace 1368. The DONE signal 1364 indicates the completion of the 
memory access period by a logic 0 to 1 signal (e.g., trace 1390) or otherwise indicates the 
5 duration of the Simulation write/read cycle with a logic 0 (e.g., combination of edge of trace 
1370 and edge of trace 1390). During the DMA transfer period, the DONE signal is at logic 0. 

At the end of the DMA transfer period, the DATAXSFR signal goes from logic 1 to 0, 
which triggers the onset of the evaluation period. Thus, EVAL 1352 is at logic 1 as indicated by 
trace 1371. The duration of the EVAL signal at logic 1 is predetermined and can be 

10 programmable. During this evaluation period, the data in the user's design logic is evaluated 
with the clk_en signal 1353 which is at logic 1 as indicated by trace 1372, the input_en signal 
1354 which is also at logic 1 as indicated by trace 1373, and the mux_en signal 1355 which is 
also at logic 1 for a longer duration than clk en and input en as indicated by trace 1374. Data is 
being evaluated within this particular FPGA logic device. When the mux_en signal 1355 goes 

15 from logic 1 to 0 at trace 1374 and at least one memory block is present in the FPGA logic 
devices, then the evaluation period ends and the memory access period begins. 

The SHIFTIN signal 1356 is asserted with a logic 1 at trace 1375. This indicates that the 
preceding FPGA has completed its evaluations and all desired data have been accessed to/from 
this preceding FPGA logic device. Now, the next FPGA logic device in the bank is ready to 

20 begin memory accesses. 

In traces 1377 to 1386, the following nomenclature will be used. ACj_k mdicates that the 
address and control signal is associated with FPGAj and memory block k, where j and k are 
integers including 0. WDj_k indicates write data for FPGAj and memory block k, RDj_k 
mdicates read data for FPGAj and memory block k. Thus, AC3_1 indicates the address and 

25 control signals associated with FPGA3 and memory block 1, The low bank SRAM accesses and 
the high bank SRAM accesses 1361 are shown as trace 1387. 

The next few traces 1377 to 1387 will show how memory access is accomplished. Based 
on the logic level of wrx signal to the EVALFSMx and consequently, the mem wr signal to the 
MEMFSM, either a write or read operation is performed. If a write operation is desired, the 

248 

SV/225583.01 
16503.302504 




memory model interfaces with the user's memory block N interface (Mem_Block_N interface 
1253 in FIG. 57) to provide wrx as one of its control signals. This control signal wrx is provided 
to the FD bus driver as well as the EVALFSMx unit. If wrx is at logic 1, the proper select 
signal and output en signal are provided to the FD bus driver to place the memory write data on 
5 the FD bus. This same control signal which is now on the FD bus can be latched by the memory 
address/control latch in the CTRL FPGA unit. The memory address/control latch sends the 
address and control signals to the SRAM via a MA[18:2]/control bus. The wrx control signal, 
which is at logic 1, is extracted from the FD bus and because a write operation is requested, the 
data associated with the address and control signals on the FD bus is sent to the SRAM memory 
10 device. 

Thus, as shown on FIG, 61, this next FPGA logic device, which is logic device FPGAO in 
the low bank, places ACO_0 on FD[31:0] as indicated by trace 1377, The Simulation system 
performs a write operation for WDO_0, Then, AC0_1 is placed on the FD[31:0] bus. If, 
however, a read operation was requested, the placement of the AC0_1 on the FD bus FD[31:0] 

15 would be followed by some time delay before RDO O instead of WDO_0 corresponding to ACO O 
is placed on the FD bus by the SRAM memory device. 

Note that placement of the ACO O on the MA[18:2]/control bus as indicated by trace 1383 
are slightly delayed than the placement of the address, control, and data on the FD bus. This is 
because the MEMFSM unit requires time to latch the address/control signals in from the FD bus, 

20 extract the mem_wr signal, and generate the proper select signal to the address/control mux so 
that address/control signals can be placed on the MA[18:2]/control bus. Furthermore, after 
placement of the address/control signals on the MA[18:2]/controI bus to the SRAM memory 
device, the Simulation system must wait for the corresponding data from the SRAM memory 
device to be placed on the FD bus. One example is the time offset between trace 1384 and trace 

25 1381, where the RD1_1 is placed on the FD bus after the AC1_1 is placed on the 
MA[18;2]/control bus. 

On the high bank, FPGAl is placing AC1_^0 on the bus FD[63:32], which is then 
followed by WD1_0. Thereafter, AC1_1 is placed on the bus FD[63:32]. This is indicated by 
trace 1380. When AC1_1 is placed on the FD bus, the control signal indicates a read operation 
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in this example. Thus, as described above, the proper wrx and mem_wr signals at logic 0 are 
presented in the address/control signals to the EVALFSMx and MEMFSM units as AC1_1 is 
placed on the MA[18:2]/control bus as indicated by trace 1384. Because the Simulation system 
knows that this is a read operation, write data will not be transported to the SRAM memory 
5 device; rather, read data associated with AC1_1 is placed on the FD bus by the SRAM memory 
device for subsequent reading by the user's design logic via the Simulation memory block 
interface. This is indicated by trace 1381 on the high bank. On the low bank, RDO l is placed 
on the FD bus as indicated by trace 1378, following the ACO l on the MA[18:2]/control bus (not 
shown). 

10 The reading operation by the user's design logic via the Simulation memory block 

interface is accomplished when the EVALFSMx generates the rdJatO signal 1362 to the memory 
read data double buffer in the Simulation memory block interface as indicated by trace 1388. 
This rdJatO signal is provided to both the low bank FPGAO and the high bank FPGAL 

Thereafter, the next memory block for each FPGA logic device is placed on the FD bus. 

15 AC2_0 is placed on the low bank FD bus, while AC3_0 is placed on the high bank FD bus. If a 
write operation is desired, WD2_0 is placed on the low bank FD bus and WD3_0 is placed on the 
high bank FD bus. AC3_0 is placed on the high bank MA[18:2]/control bus as indicated on trace 
1385. This process continues for the next memory block for write and read operations. Note 
that the write and read operations for the low bank and the high bank can occur at differing times 

20 and speeds and FIG. 61 shows one particular example where the timing for the low and high 
banks are the same. Additionally, write operations for the low and high banks occur together, 
followed by read operations on both banks. This may not always be the case. The existence of 
low and high banks allows parallel operation of the devices coupled to these banks; that is, 
activity on the low bank is independent of activity on the high bank. Other scenarios can be 

25 envisioned where the low bank is performing a series of write operations while the high bank is 
performing a series of read operations in parallel. 

When the last data in the last FPGA logic device for each bank is encountered, the 
SHIFTOUT signal 1357 is asserted as indicated by trace 1376. For read operations, a rd latl 
signal 1363 corresponding to FPGA2 on the low bank and FPGA3 on the high bank is asserted as 
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indicated by trace 1389 to read RD2_1 on trace 1379 and RD3_1 on trace 1382. Because the last 
data for the last FPGA units have been accessed, the completion of the Simulation write/read 
cycle is indicated by the DONE signal 1364 as mdicated by trace 1390. 

The following Table H lists and describes the various components on the Sunulation 
5 system boards and their corresponding register/memory, PCI memory address, and local address. 



TABLE H: MEMORY MAP 



h '% "^.l h % '^1^ ■ '^V'^-l- 
%%■% % "^^ -^'i 


. % -^C -u.^- -af -^^ 
I ^ 4r :f p Ji^ p- m )b„ 


S \^ J' J ..^ .1^ ^' :p ^ ^ 
ll= ^^^-v. :.;.ih. ■% J% 


■ :;y iir .iV ii- fli- S|! 


:• :::: 'll 1^1 'kk ' ■■ ■■ '^h- 


PLX9080 


PCI Configuration 
Registers 


OOH to 3CH 






PLX9080 


Local Config. / 
Runtime/ DMA 
Registers 


Offset from PCI base addr 0: 
0~FFh 


Offset from 
CS addr: 80h 
- 180h 


Accessible from PCI 
and Local buses 


CTRL_FPGA[6:1] 


XSFR_EVAL 
Register 


Offset from PCI base addr 2: 
Oh 


Oh 


in Local Space 0 


CTRL_FPGA1 


CONFIG JTAGl 
Register 


Offset from PCI base addr 2: 
lOh 


lOh 


in Local Space 0 


CTRL_FPGA2 


C0NFIGJTAG2 
Register 


Offset from PCI base addr 2: 
14h 


14h 


in Local Space 0 


CTRL_FPGA3 


C0NFIG_JTAG3 
Register 


Offset from PCI base addr 2: 
18h 


18h 


in Local Space 0 


CTRL_FPGA4 


C0NFIG_JTAG4 
Register 


Offset from PCI base addr 2: 
ICh 


ICh 


in Local Space 0 


CTRL_FPGA5 


C0NFIG_JTAG5 
Register 


Offset from PCI base addr 2: 
18h 


20h 


in Local Space 0 


CTRL_FPGA6 


C0NFIG_JTAG6 
Register 


Offset from PCI base addr 2: 
ICh 


24h 


in Local Space 0 


CTRLFPGAl 


Local RAM 


Offset from PCI base addr 2: 
400h - 7FFh 


400h - 7FFh 


in Local Space 0 
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FPGA[3:0] 


SPACED 


Offset from PCI base addr for 
chO DMA: 0 - FFF FFFFh 


8000 OOOOh 
to 8FFF 
FFFFh 


DMA write transfer for 
GLOBAL and S2H 
data 


FPGA[3:0] 


SPACEl 


Offset from PCI base addr for 
chO DMA: 0 - FFF FFFFh 


9000 OOOOH 
to 9FFF 
FFFFh 


DMA write transfer for 

REGISTER_WRITE 

data 


FPGA[3:0] 


SPACE2 


Offset from PCI base addr for 
chl DMA: 0 - FFF FFFFh 


AOOO OOOOH 
to AFFF 
FFFFh 


DMA read transfer for 
H2S data 


FPGA[3:0] 


SPACE3 


Offset from PCI base addr for 
chl DMA: 0 - FFF FFFFh 


BOOO OOOOH 
to BFFF 
FFFFh 


DMA read transfer for 

REGISTER_READ 

data 


L_SRAM, 
H_SRAM 


SPACE4 


Offset from PCI base addr for 
chO DMA: 0 - FFF FFFFh 


COOO OOOOH 
to CFFF 
FFFFh 


DMA write transfer for 
SRAM 


L_SRAM, 
H_SRAM 


SPACE5 


Offset from PCI base addr for 
chl DMA: 0- FFF FFFFh 


DOOO OOOOH 
to DFFF 
FFFFh 


DMA read transfer for 
SRAM 




SPACE6 


Offset from PCI base addr for 
chl DMA: 0 - FFF FFFFh 


EOOO OOOOH 
to EFFF 
FFFFh 


Reserved 




SPACE? 


Offset from PCI base addr for 
chl DMA: 0 - FFF FFFFh 


FOOO OOOOH 
to FFFF 
FFFFh 


Reserved 



The data format for the configuration file is shown below in Table J in accordance with 
one embodiment of the present invention. The CPU sends one word through the PCI bus each 
time to configure one bit for all on-board FPGAs in parallel. 

TABLE J: CONFIGURATION DATA FORMAT 
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kbits J 


bit16-31 


wordO 


DO(FPGAO) 


D0(FPGA1) 


D0(FPGA2) 


D0(FPGA3) 


control/status 


word1 


D1(FPGA0) 


D1(FPGA1) 


D1(FPGA2) 


D1(FPGA3) 


control/status 


word2 


D2(FPGA0) 


D2(FPGA1) 


D2(FPGA2) 


D2(FPGA3) 


control/status 


words 


D3(FPGA0) 


D3(FPGA1) 


D3(FPGA2) 


D3(FPGA3) 


control/status 


word4 


D4(FPGA0) 


D4(FPGA1) 


D4(FPGA2) 


D4(FPGA3) 


control/status 


words 


D5(FPGA0) 


D5(FPGA1) 


D5(FPGA2) 


D5(FPGA3) 


control/status 



The following Table K lists the XSFR_EVAL register. It resides in all the boards. The 
XSFR_EVAL register is used by the host computing system to program the EVAL period, 
control DMA read/write, and read the status of the EVAL_DONE and XSFR DONE fields. The 
host computing system also uses this register to enable memory access. The operation of the 
Simulation system with respect to this register is described below with in conjunction with FIGS. 
62 and 63. 



TABLE K: XSFR_EVAL REGISTER for all 6 boards (Local Addr: Oh) 













7:0 


EVALTIME[7:0] 


Eval time in cycles of PCI clock 


R/W 


Oh 


8 


EVALDONE 


Eval_doi\e flag. Cleared by setting WR_XSFR bit. 


R 


0 


9 


XSFRDONE 


Xsfr done flag for both read and write. Cleared by 
writing XSFR_^EVAL register. 


R 


0 


10 


RDXSFREN 


Enable DMA-read-transfer. Cleared by 
XSFR_DONE. 


R/W 


0 


11 


WR_XSFR_EN 


Enable DMA-write-transfer. Cleared by 
XSFR_DONE. When both WR_XSFR and 
RD XSFR are set, CTRL FPGA executes DMA- 
write-transfer first, then DMA-read-transfer 
automatically. 


RAV 


0 


19:12 




Reserved 


R/W 


Oh 


20 


FCLRN 


Resets all FPGA[3:0] when low. 


R/W 


0 
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irtririirt' 




21 


WAIT_EVAL 


This bit is effective if both RD_XSFR and WR_XShK 
are set. 

When 1, DMA-read-transfer starts after 

EVAL DONE. When 0, DMA-read-transfer starts 

after CLK_EN. 


R/W 


0 

0 


22 


MEM EN 


Enable on-board SRAM 


R/W 




31:23 




Reserved 







The following Table L lists the contents of the CONFIGJTAG [6:1] register. The CPU 
configures the FPGA logic devices and runs the boundary scan test for FPGA logic devices 
through this register. Each board has one dedicated register. 



TABLE L: CONFIGJTAG [6:1] REGISTER 



liiSI 



15:0 



16 



..-..SIGNAL-..- ['-. . - -^^mm^ ' 

Config data for FPGA[15:0] 



CONF D[15:0] 



NCONFIG 



Start configuration at low-to-high transition. 



R/W 
'WW 



VALUE 

:aeter; 

-BESET- 



Oh 

"oT 



17 



18 



19 



20 



21 



22 



CONFDONE 



Config done 



CONF CLK 



Config clock 



NSTATUS 



Config status, error when low 



F OE 



Output enable to all on-board Simulation FPGAs 



JTAG TCK 



JTAG clock 



JTAG TMS 



JTAG mode select 



R 
R/w" 



R 

R/W 



RAV 

■r/w 



Oh 

~o" 



0 

"o" 



23 



24 



JTAG TDI 



JTAG data in - send to TDI of FPGAO 



JTAG TDO 



JTAG data out - from TDO of FPGA3 



R/W 
~R~ 



25 



26 

It" 



JTAG NR 



Reset JTAG test when low. 



LED2 
LED3" 



1 = turn on LED2 for Config_status. 0 = turn off 
1 = turn on LED3 for DataXsfir/Diag. 0 = turn on. 



RAV 

"r/w 

R/W 



31:28 



Reserved 



0 

T 
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FIGS. 62 and 63 show timing diagrams of another embodiment of the present invention. 
These two figures show the operation of the Simulation system with respect to the XSFR EVAL 
register. The XSFR_EVAL register is used by the host computing system to program the EVAL 
period, control DMA read/write, and read the status of the EVAL DONE and XSFR DONE 
5 fileds. The host computing system also uses this register to enable memory access. One of the 
main differences between these two figures is the state of the WAIT EVAL field. When 
WAIT_EVAL field is set to "0," as is the case for FIG. 62, the DMA read transfer starts after 
CLK EN. When WAIT EVAL field is set to " 1," as is the case for FIG, 63, the DMA read 
transfer starts after EVAL_DONE. 

10 In FIG. 62, both WR XSFR EN and RD XSFR EN are set to " 1 . " These two fields 

enable DMA write/read transfers and can be cleared by XSFR DONE. Because both fields are 
set to " 1," the CTRL_FPGA unit automatically executes DMA write transfer first and then DMA 
read transfer. The WAIT_EVAL field, however, is set to "0" indicating that the DMA read 
transfer starts after the assertion of CLK_EN (and after the completion of the DMA write 

15 operation). Thus, in FIG. 62, the DMA read operation occurs almost immediately after the 
completion of the DMA write operation as soon as the CLK_EN signal (software clock) is 
detected. The DMA read transfer operation does not wait for the completion of the EVAL 
period. 

At the beginning of the timing diagram, EVAL_REQ_N signals experience contention as 
20 multiple FPGA logic devices vie for attention. As explained previously, the EVAL REQ N (or 
EVAL_REQ#) signal is used to start the evaluation cycle if any of the FPGA logic devices asserts 
this signal. At the end of the data transfer, the evaluation cycle begins including address pointer 
initialization and the operation of the software clocks to facilitate the evaluation process. 

The DONE signal, which is generated at the conclusion of a DMA data transfer period, 
25 also experiences contention as multiple LAST signals (from the shiftin and shiftout signals at the 
output of each FPGA logic device) are generated and provided to the CTRL_FPGA unit. When 
all the LAST signals are received and processed, the DONE signal is generated and a new DMA 
data transfer operation can begin. The EVAL_REQ_N signal and the DONE signal use the same 
wire on a time-shared basis in a manner to be discussed below. 
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The system automatically initiates DMA write transfer first as is shown by the WR_XSFR 
signal at time 1409. The initial portion of the WR XSFR signal includes some overhead 
associated with the PCI controller, the PCI9080 or 9060 in one embodiment. Thereafter, the host 
computing system performs a DMA write operation via the local bus LD[31:0] and the FPGA bus 
5 FD[63:0] to the FPGA logic devices coupled to the FPGA bus FD[63:0]. 

At time 1412, the WR XSFR signal is deactivated indicating the completion of the DMA 
write operation. The EVAL signal is activated for a predetermined time from time 1412 to time 
1410. The duration of the EVALTIME is programmable and initially set at 8+X, where X is 
derived from the longest signal trace path. The XSFR DONE signal is also activated for a brief 
10 time to indicate the completion of this DMA transfer operation in which the present operation is a 
DMA write. 

Also at time 1412, the contention among EVAL_REQ_N signals ceases but the wire that 
carries the DONE signal now delivers the EVAL_REQ_N signal to the CTRL FPGA unit. For 3 
clock cycles, the EVAL_REQ_N signals are processed via the wire that carries the DONE signal 

15 After 3 clock cycles, the EVAL REQ N signals are no longer generated by the FPGA logic 
devices but the EV AL REQ N signals that have previously been delivered to the CTRL FPGA 
unit will be processed. The maximum tune that the EVAL_REQ_N signals are no longer 
generated by the FPGA logic devices for gated clocks is roughly 23 clock cycles. 
EVAL_REQ_N signals longer than this period will be ignored. 

20 At time 1413, approximately 2 clock cycles after time 1412 (which is at the end of the 

DMA write operation), the CTRL FPGA unit sends a write address strobe WPLX ADS N signal 
to the PCI controller (e.g., PLX PCI9080) to initiate the DMA read transfer. In about 24 clock 
cycles from time 1413, the PCI controller will start the DMA read transfer process and the 
DONE signal is also generated. At time 1414, prior to the start of the DMA read process by the 

25 PCI controller, the RD XSFR signal is activated to enable the DMA read transfer. Some PLX 
overhead data is transmitted and processed first. At time 1415, during the time that this overhead 
data is processed, the DMA read data is placed on the FPGA bus FD[63:0] and the local bus 
LD[31:0]. At the end of the 24 clock cycles from time 1413 and at the time of the activation of 
the DONE signal and the generation of the EVAL REQ N signals from the FPGA logic devices, 
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the PCI controller processes the DMA read data by transporting the data from the FPGA bus 
FD[63:0] and the local bus LD[31:0] to the host computer system. 

At time 1410, the DMA read data will continue to be processed while the EVAL signal 
will be deactivated and the EVAL_DONE signal will be activated to indicate the completion of 
5 the EVAL cycle. Contention among the FPGA logic devices also begins as they generate the 
EVAL_REQ_N signals. 

At time 1417, just prior to the completion of the DMA read period at time 1416, the host 
computer system polls the PLX interrupt register to determine if the end of the DMA cycle is 
near. The PCI controller knows how many cycles are necessary to complete the DMA data 

510 transfer process. After a predetermined number of cycles, the PCI controller will set a particular 
bit in its interrupt register. The CPU in the host computer system polls this interrupt register in 
the PCI controller. If the bit is set, the CPU knows that the DMA period is almost done. The 
CPU in the host system does not poll the interrupt register all the time because then it will tie up 
the PCI bus with a read cycle. Thus, in one embodiment of the present invention, the CPU in the 

y 15 host computer system is programmed to wait a certain number of cycles before it polls the 

1^^ interrupt register. 

n After a brief time, the end of the DMA read period occurs at time 1416 as the RD_XSFR 

is deactivated and the DMA read data is no longer on the FPGA bus FD[63:0] or the local bus 
LD[31:0]. The XSFR_DONE signal is also activated at time 1416 and contention among the 
20 LAST signals for generation of the DONE signal begins. 

During the entire DMA period from the generation of the WR_XSFR signal at time 1409 
to time 1417, the CPU in the host computer system does not access the Simulation hardware 
system. In one embodunent, the duration of this period is the sum of (1) overhead time for the 
PCI controller times 2, (2) the number of words of WR_^XSFR and RD^XSFR, and (3) the host 
25 computer system's (e.g.. Sun ULTRASparc) PCI overhead. The first access after the DMA 
period occurs at time 1419 when the CPU polls the interrupt register in the PCI controller. 

At time 1411, which is about 3 clock cycles after time 1416, the MEM EN signal is 
activated to enable the on-board SRAM memory devices so that memory access between the 
FPGA logic devices and the SRAM memory devices can begin. Memory access continues until 
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time 1419 and in one embodiment, 5 clock cycles are necessary per access. If no DMA read 
transfer is necessary, then the memory access can begin earlier at time 1410 instead of time 1411. 

While the memory access takes place between the FPGA logic devices and the SRAM 
memory devices across the FPGA bus FD[63:0], the CPU in the host computer system can 
5 communicate with the PCI controller and the CTRL_FPGA unit via the local bus LD[31:0] from 
time 1418 to time 1429. This occurs after the CPU has completed polling the interrupt register 
of the PCI controller. The CPU writes data onto various registers in preparation for the next data 
transfer. The duration of this period is greater than 4 jusec. If the memory access is shorter than 
this period, the FPGA bus FD[63:0] will not experience any conflicts. At time 1429, the 

10 XSFR DONE signal is deactivated. 

In FIG. 63, the timing diagram is somewhat different from that of FIG. 62 because in 
FIG. 63 the WAIT^EVAL field is set to " 1 . " In other words, the DMA read transfer period 
starts after the EVAL_DONE signal has been activated and is almost completed. It waits for the 
near completion of the EVAL period instead of starting immediately after the completion of the 

15 DMA write operation. The EVAL signal is activated for a predetermined time from time 1412 to 
time 1410. At time 1410, the EVAL DONE signal is activated to indicate the completion of the 
EVAL period. 

In FIG. 63, after the DMA write operation at time 1412, the CTRL FPGA unit does not 
generate the write address strobe signal WPLX ADS N to the PCI controller until time 1420, 

20 which is about 16 clock cycles before the end of the EVAL period. The XSFR DONE signal is 
also extended to time 1423. At time 1423, the XSFR_DONE field is set and the WPLX ADS_N 
signal can then be generated to start the DMA read process. 

At time 1420, approximately 16 clock cycles before the activation of the EVAL DONE 
signal, the CTRL_FPGA unit sends a write address strobe WPLX ADS N signal to the PCI 

25 controller (e.g., PLX PCI9080) to initiate the DMA read transfer. In about 24 clock cycles from 
time 1420, the PCI controller will start the DMA read transfer process and the DONE signal is 
also generated. At time 1421, prior to the start of the DMA read process by the PCI controller, 
the RD XSFR signal is activated to enable the DMA read transfer. Some PLX overhead data is 
transmitted and processed first. At time 1422, during the time that this overhead data is 
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processed, the DMA read data is placed on the FPGA bus FD[63:0] and the local bus LD[31:0]. 
At the end of the 24 clock cycles at tune 1424, the PCI controller processes the DMA read data 
by transporting the data from the FPGA bus FD[63:0] and the local bus LD[31:0] to the host 
computer system. The remainder of the timing diagram is equivalent to that of FIG. 62. 
5 Thus, the RD_XSFR signal in FIG. 63 is activated later than m FIG, 62. The RD XSFR 

signal in FIG. 63 follows the near completion of the EVAL period so that the DMA read 
operation is delayed. The RD XSFR signal in FIG. 62 follows the detection of the CLK EN 
signal after the completion of the DMA write transfer. 

In the above embodiment, the verification system mapped memory blocks that were in the 

10 FPGA chips into the on-board SRAMs on the FD bus. Referring to FIG. 56, for example, 

memory block A in FPGA chip 1203 and memory block B in FPGA chip 1201 are mapped into 
SRAMs 1205 and 1206, respectively. In accordance with another embodiment of the present 
invention, the verification system can map memory blocks into any memory device or storage 
that the computer system can access. This includes main memory, PCI expansion memory, 

15 DRAM, SRAM, ROM, and the like. For example, referring now to FIGS. 46 and 56, assume 
that memory block A is in FPGA chip 1203, memory block B is in FPGA chip 1201, and memory 
blocks C and D are in FPGA chip 1202. 

Accordingly, to use the above example, one embodiment of the present invention can map 
these memory blocks from the FPGA chips into the SRAMs, as well as RAM 15 and memory in 

20 PCI device 54 (see FIG. 46). Thus, memory block A is mapped into SRAM 1205, memory 

block B is mapped into SRAM 1206, memory block C is mapped mto main memory 15 (see FIG. 
46), and memory block D is mapped into memory in PCI device 54 (see FIG. 46). Usually, this 
scheme is employed when the capacities of the SRAMs 1205 and 1206 are too small. 
Alternatively, this scheme is employed when the memory block that needs to be mapped is larger 

25 than the on-board SRAM, or the memory block is shared by other software models and test 
benches. Mapping these memory blocks is important since the CPU needs to dump and 
manipulate memory data very often during simulation. 

In order to accomplish this memory mapping mto external memory, the CPU performs the 
equivalent memory access fimction of memory control blocks as CTRL FPGA 1200 (see FIG. 56 
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and associated discussion) and the evaluation logic in the logic device which contains the memory 
blocks. The equivalent connection between memory blocks and the external memory devices are 
also provided. 

Implementing this system is analogous to the embodiment above. For the bus driver of 
5 the external memory, the first mux input (see mux 1249 in FIG. 57) is connected to the user 
memory interface and memory write data (DMA RD space 2). In the SRAM memory mapping 
embodiment (see FIG. 57), the third mux input is connected to the user memory interface and the 
fourth mux input is connected to the memory write data. 

For the memory block interface, the same memory converter from the previous 
^^10 embodiment (see memory model 1252 in FIG, 57) is used. The external memory read data are 
?5 sent to hardware by DMA WR space 0. In the previous embodiment, the memory block interface 

includes the memory converter and the doxible buffer (for the memory read data). 
i2 For the evaluation logic, the signals of shiftin and shiftout for on-board SRAM access are 

not used. In the previous on-board SRAM memory mapping embodiment, the signals of eval, 
Q 15 shiftm, and shiftout are used. 

For memory initialization and dumping, the previous on-board SRAM embodiment used 
DMA space 4 and 5 through the CTRL_FPGA 1200 unit. In the external memory embodiment, 
memory access is by the CPU. 

For memory access during simulation, the previous on-board SRAM embodiment located 
20 memory blocks in the FPGA chips which sent address and read/write signals to the bus controller 
in the CTRL^FPGA unit through the FD bus. These signals are then converted and sent to the 
on-board SRAM. The memory write or read data are placed on the FD bus by a memory block 
interface or memory devices depending on the write or read operation. The read data are fetched 
by the memory block interface at the end of the evaluation sequence. In the external memory 
25 embodiment, the write data, address, and read/write signals from the memory blocks are sent to 
the computer system through DMA RD space 2. Then, the CPU performs memory access to the 
mapped memory location. The memory read data are sent to the driven logic located in the 
FPGA chips through DMA WR space 0. Essentially, space 2 is used to read the data, then 
evaluation occurs, and then the system uses space 0 to put the read data in the appropriate logic. 
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IX. COVERIFICATION SYSTEM 

The coverification system of the present invention can accelerate the design/development 
cycle by providing designers with the flexibility of software simulation and the faster speed 
5 derived from using a hardware model. Both the hardware and software portions of a design can 
be verified prior to ASIC fabrication and without the limitations of an emulator-based 
coverification tool. The debugging feature is enhanced and overall debug time can be 
significantly reduced. 

10 Conventional coverification tool with ASIC as the device-under-test 

FIG. 64 shows a typical final design embodied as a PCI add-on card, such as a video, 
multimedia, Ethernet, or SCSI card. This card 2000 includes a direct interface connector 2002 
that allows communication with other peripheral devices. The connector 2002 is coupled to bus 

2001 to transport video signals from a VCR, camera, or television tuner; video and audio outputs 
15 to a monitor or speaker; and signals to communication or disk drive interface. Depending on the 

user's design, one ordinarily skilled in the art can anticipate other interface requirements. The 
bulk of die fimctionality of the design is in chip 2004 which is coupled to the interface connector 

2002 via bus 2003, local oscillator 2005 via bus 2007 for generating a local clock signal, and 
memory 2006 via bus 2008. The add-on card 2000 also includes a PCI connector 2009 for 

20 coupling with a PCI bus 2010. 

Prior to implementing the design as an add-on card as shown m FIG. 64, the design is 
reduced to ASIC form for testing purposes. A conventional hardware/software coverification tool 
is shown in FIG. 65. The user's design is embodied in the form of an ASIC labeled as the 
device-under-test (or "DUT") 2024 in FIG. 65. To obtain stimulus from a variety of sources 

25 with which it is designed to interface, the device-under-test 2024 is placed in the target system 
2020, which is a combmation of the central computing system 2021 on the motherboard and 
several peripherals. The target system 2020 includes a central computing system 2021 which 
includes a CPU and memory, and operates under some operating system such as Microsoft 
Windows or Sun MicroSystem's Solaris to run a number of applications. As known to those 
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ordinarily skilled in the art, Sun MicroSy stem's Solaris is an operating environment and set of 
software products which support Internet, Intranet and enterprise-wide computing. The Solaris 
operating environment is based on industry standard UNIX System V Release 4, and designed for 
client-server applications in a distributed networking environment, provide the appropriate 
5 resources for smaller workgroups, and provide the WebTone that is required for electronic 
commerce. 

The device driver 2022 for the device-under-test 2024 is included in the central computing 
system 2021 to enable communication between the operating system (and any applications) and 
the device-under-test 2024. As known to those ordinarily skilled in the art, a device driver is a 

JUO particular software to control a hardware component or peripheral device of a computer system. 

f£ A device driver is responsible for accessing the hardware registers of the device and often 

includes an interrupt handler to service interrupts generated by the device. Device drivers often 

|,I form part of the lowest level of the operating system kernel, with which they are linked when the 
kernel is built. Some more recent systems have loadable device drivers which can be installed 

O 15 fi'om files after the operating system is running. 

hi The device-under-test 2024 and the central computing system 2021 are coupled to a PCI 

bus 2023. Other peripherals in the target system 2020 include an Ethernet PCI add-on card 2025 
used to couple the target system to a network 2030 via bus 2034, a SCSI PCI add-on card 2026 
coupled to SCSI drives 2027 and 2031 via buses 2036 and 2035, a VCR 2028 coupled to the 

20 device-under-test 2024 via bus 2032 (if necessary for the design in the device-under-test 2024), 
and a monitor and/or speaker 2029 coupled to the device-under-test 2024 via bus 2033 (if 
necessary for the design in the device-under-test 2024). As known to those ordinarily skilled in 
the art, "SCSI" stands for "Small Computer Systems Interface," a processor-independent 
standard for system-level interfacing between a computer and intelligent devices such as hard 

25 disks, floppy disks, CD-ROM, printers, scaimers and many more. 

In this target system environment, the device-under-test 2024 can be examined with a 
variety of stimuli fi-om the central computing system (i.e., operating system, applications) and the 
peripheral devices. If time is not a concern and the designers are only seeking a simple pass/fail 
test, this CO verification tool should be adequate to fulfill their needs. However, in most 
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situations, a design project is strictly budgeted and scheduled prior to release as a product. As 
explained above, this particular ASIC-based coverification tool is unsatisfactory because its debug 
feature is nonexistent (the designer cannot isolate the cause of a "failed" test without 
sophisticated techniques, and the number of "fixes" for every bug detected cannot be predicted at 
5 the outset of a project and thus makes scheduling and budgeting unpredictable. 

Conventional coverification tool with an emulator as the device-under-test 

FIG. 66 illustrates a conventional coverification tool with an emulator. Unlike the set-up 
_ illustrated in FIG. 64 and described above, the device-under-test is programmed in an emulator 
,q10 2048 coupled to the target system 2040 and some peripheral devices and a test workstation 2052. 
fi The emulator 2048 mcludes an emulation clock 2066 and the device-under-test which was 

programmed in the emulator. 
U The emulator 2048 is coupled to the target system 2040 via a PCI bus bridge 2044 and 

r ' PCI bus 2057 and control lines 2056. The target system 2040 includes a combination of the 

15 central computing system 2041 on the motherboard and several peripherals. The target system 
H 2040 includes a central computing system 2041 which includes a CPU and memory, and operates 
CLi under some operating system such as Microsoft Windows or Sun MicroSystem's Solaris to run a 
number of applications. The device driver 2042 for the device-under-test is included in the 
central computing system 2041 to enable communication between the operating system (and any 
20 applications) and the device-under-test in the emulator 2048. To communicate with the emulator 

2048 as well as other devices which are part of this computing environment, the central 
computing system 2041 is coupled to the PCI bus 2043. Other peripherals in the target system 
2040 include an Ethernet PCI add-on card 2045 used to couple the target system to a network 

2049 via bus 2058, and a SCSI PCI add-on card 2046 coupled to SCSI drives 2047 and 2050 via 
25 buses 2060 and 2059. 

The emulator 2048 is also coupled to the test workstation 2052 via bus 2062. The test 
workstation 2052 includes a CPU and memory to perform its functions. The test workstation 
2052 may also include test cases 2061 and device models 2068 for other devices that are modeled 
but not physically coupled to the emulator 2048. 
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Finally, the emulator 2048 is coupled to some other peripheral devices such as a frame 
buffer or data stream record/play system 2051 via bus 2061. This frame buffer or data stream 
record/play system 2051 may also be coupled to a communication device or chaimel 2053 via bus 
2063, a VCR 2054 via bus 2064, and a monitor and/or speaker 2055 via bus 2065. 
5 As known to those ordinarily skilled in the art, the emulation clock operates at a speed 

much slovs^er than the actual target system speed. Thus, that portion of FIG. 66 that is shaded is 
running at emulation speed while the other unshaded portions are running at actual target system 
speed. 

As described above, this coverification tool with the emulator has several limitations. 

10 When using a logic analyzer or a sample-and-hold device to get internal state information of the 
device-under-test, the designer must compile his design so that the relevant signals he is 
interested in examining for debug purposes are provided on the output pins for sampling. If the 
designer wants to debug a different part of the design, he must make sure that that part has output 
signals that can be sampled by the logic analyzer or the sample-and-hold device or else he must 

15 re-compile his design in the emulator 2048 so that these signals can be presented on the output 
pins for sampling purposes. These re-compile times may take days or weeks, which may be too 
lengthy of a delay for a time-sensitive design/development schedule. Furthermore, because this 
coverification tool uses signals, sophisticated circuitry must be provided to either convert these 
signals to data or to provide some signal-to-signal timing control. Moreover, the necessity of 

20 usuig numerous wires 2061 and 2062 necessary for each signal desired for sampling increases the 
debug set-up burden and time. 

Simulation with Reconfigurable Computmg Array 

As a brief review, FIG. 67 illustrates a high level configuration of the single-engine 
25 reconfigurable computing (RCC) array system of the present invention which was previously 

described above m this patent specification. This single-engine RCC system will be incorporated 
into the coverification system in accordance with one embodiment of the present invention. 

In FIG. 67, the RCC array system 2080 includes a RCC computing system 2081, a 
reconfigurable computing (RCC) hardware array 2084, and a PCI bus 2089 coupling them 
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together. Importantly, the RCC computing system 2081 includes the entire model of the user's 
design in software and the RCC hardware array 2084 includes a hardware model of the user's 
design. The RCC computing system 2081 includes the CPU, memory, an operating system, and 
the necessary software to run the single-engine RCC system 2080* A software clock 2082 is 
5 provided to enable the tight control of the software model in the RCC computing system 2081 
and the hardware model in the RCC hardware array 2084. Test bench data 2083 are also stored 
in the RCC computing system 2081. 

The RCC hardware array system 2084 includes a PCI interface 2085, a set of RCC 
hardware array boards 2086, and various buses for interface purposes. The set of RCC hardware 

10 array boards 2086 includes at least a portion of the user's design modeled in hardware (i.e., 
hardware model 2087) and memory 2088 for the test bench data. In one embodiment, various 
portions of this hardware model are distributed among a plurality of reconfigurable logic elements 
(e.g., FPGA chips) during configuration tune. As more reconfigurable logic elements or chips 
are used, more boards may be needed. In one embodiment, four reconfigurable logic elements 

15 are provided on a single board. In other embodiments, eight reconfigurable logic elements are 
provided on a single board. The capacity and capabilities of the reconfigurable logic elements in 
the four-chip boards can differ significantly from that of the reconfigurable logic elements in the 
eight-chip board. 

Bus 2090 provides various clocks for the hardware model from the PCI interface 2085 to 
20 the hardware model 2087. Bus 2091 provides other I/O data between the PCI interface 2085 and 
the hardware model 2087 via connector 2093 and mternal bus 2094. Bus 2092 functions as the 
PCI bus between the PCI interface 2085 and the hardware model 2087. Test bench data can also 
be stored in memory in the hardware model 2087. The hardware model 2087, as described 
above, includes other structures and functions other than the hardware model of the user's design 
25 that are needed to enable the hardware model to interface with the RCC computing system 2081, 
This RCC system 2080 may be provided in a single workstation or alternatively, coupled 
to a network of workstations where each workstation is provided access to the RCC system 2080 
on a time-shared basis. In effect, the RCC array system 2080 serves as a simulation server 
having a simulation scheduler and state swapping mechanism. The server allows each user at a 
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workstation to access the RCC hardware array 2084 for high speed acceleration and hardware 
state swapping purposes. After the acceleration and state swapping, each user can locally 
simulate the user design in software while releasing control of the RCC hardware array 2084 to 
other users at other workstations. This network model will also be used for the coverification 
5 system described below. 

The RCC array system 2080 provides designers with the power and flexibility of 
simulating an entire design, accelerating part of the test points during selected cycles via the 
hardware model in the reconfigurable computing array, and obtaining internal state information of 
virtually any part of his design at any time. Indeed, the single-engine reconfigurable computing 
7^10 array (RCC) system, which can be loosely described as a hardware-accelerated simulator, can be 
}^ used to perform the following tasks in a single debug session: (1) simulation alone, (2) simulation 
:.p with hardware acceleration where the user can start, stop, assert values, and inspect internal 
i,i states of the design at any time, (3) post-simulation analyses, and (4) in-circuit emulation. 

Because both the software model and the hardware model are under the strict control of a single 
Q 15 engine via a software clock, the hardware model in the reconfigurable computing array is tightly 
U coupled to the software simulation model. This allows the designer to debug cycle-by-cycle as 
well as accelerate and decelerate the hardware model through a number of cycles to obtain 
valuable internal state mformation. Moreover, because this simulation system handles data 
instead of signals, no complex signal-to-data conversion/timing circuitry is needed. Furthermore, 
20 the hardware model in the reconfigurable computing array does not need to be re-compiled if the 
designer wishes to examine a different set of nodes, unlike the typical emulation system. For 
further details, review the description above. 

Coverification System without External I/O 
25 One embodiment of the present invention is a coverification system which uses no actual 

and physical external I/O devices and target applications. Thus, a coverification system in 
accordance with one embodiment of the present invention can incorporate the RCC system along 
with other functionality to debug the software portion and hardware portion of a user's design 
without using any actual target system or I/O devices. The target system and external I/O 
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devices are, instead, modeled in software in the RCC computing system. 

Referring to FIG. 68, the coverification system 2100 includes a RCC computing system 
2101, the RCC hardware array 2108, and a PCI bus 2114 coupling them together. Importantly, 
the RCC computing system 2101 includes the entire model of the user's design in software and 
5 the reconfigurable computing array 2108 includes a hardware model of the user's design. The 
RCC computing system 2101 includes the CPU, memory, an operating system, and the necessary 
software to run the single-engine coverification system 2100. A software clock 2104 is provided 
to enable the tight control of the software model in the RCC computing system 2101 and the 
hardware model in the reconfigurable computing array 2108. Test cases 2103 are also stored in 
3 10 die RCC computing system 2101 . 

In accordance with one embodiment of the present invention, the RCC computing system 
2101 also includes the target applications 2102, a driver 2105 of the hardware model of the user's 
design, a model of a device (e.g., a video card) and its driver in software labeled as 2106, and a 
model of another device (e.g., a monitor) and its driver also in software labeled as 2107. 
£ 15 Essentially, the RCC computing system 2101 contains as many device models and drivers as 
necessary to convey to the software model and the hardware model of the user's design that an 
actual target system and other I/O devices are part of this computing environment. 

The RCC hardware array 2108 includes a PCI interface 2109, a set of RCC hardware 
array boards 2110, and various buses for interface purposes. The set of RCC hardware array 
20 boards 2110 includes at least a portion of the user's design modeled in hardware 2112 and 
memory 2113 for the test bench data. As described above, each board contains a plurality of 
reconfigurable logic elements or chips. 

Bus 2115 provides various clocks for the hardware model firom the PCI interface 2109 to 
the hardware model 2112. Bus 2116 provides other I/O data between the PCI interface 2109 and 
25 the hardware model 2112 via connector 2111 and internal bus 2118. Bus 2117 functions as the 
PCI bus between the PCI interface 2109 and the hardware model 21 12. Test bench data can also 
be stored in memory in the hardware model 2113. The hardware model, as described above, 
includes other structures and fimctions other than the hardware model of the user's design that are 
needed to enable the hardware model to interface with the RCC computing system 2101. 
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To compare the coverification system of FIG. 68 to the conventional emulator-based 
coverification system, FIG. 66 shows the emulator 2048 coupled to the target system 2040, some 
I/O devices (e.g., frame buffer or data stream record/play system 2051), and a workstation 2052. 
This emulator configuration provides numerous problems and set-up issues for the designer. 
5 The emulator needs a logic analyzer or a sample-and-hold device to measure internal states of the 
user design modeled in the emulator. Because the logic analyzer and the sample-and-hold device 
needs signals, complex signal-to-data conversion circuitry is required. Additionally, complex 
signal-to-signal timing control circuitry is also required. The numerous wires needed for every 
signal that will be used to measure the internal states of the emulator farther burden the user 
J 10 during set-up. During the debug session, the user must re-compile the emulator each time he 

wants to examine a different set of internal logic circuitry so that the appropriate signals from that 
logic circuitry are provided as outputs for measurement and recording by the logic analyzer or the 
sample-and-hold device. The long re-compilation time is too costly. 

In the coverification system of the present invention in which no external I/O devices are 
3 15 coupled, the target system and other I/O devices are modeled in software so that an actual 

physical target system and I/O devices are not physically necessary. Because the RCC computing 
system 2101 processes data, no complex signal-to-data conversion circuitry or signal-to-signal 
timing control circuitry are needed. The number of wires are also not tied to the number of 
signals and hence, set-up is relatively simple. Furthermore, debugging a different portion of the 
20 logic circuitry in the hardware model of the user design does not require re-compilation because 
the coverification system processes data and not signals. Because the RCC computing system 
controls the RCC hardware array with the software-controlled clock (i.e., software clock and 
clock edge detection circuitry), starting and stopping the hardware model is facilitated- Reading 
data from the hardware model is also easy because the model of the entire user design is in 
25 software and the software clock enables synchronization. Thus, the user can debug by software 
simulation alone, accelerate part or all of the design in hardware, step through various desired 
test points cycle-by-cycle, inspect internal states of the software and hardware model (i.e., 
register and combinational logic states). For example, the user can simulate the design with some 
test bench data, then download internal state information to the hardware model, accelerate the 
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design with various test bench data with hardware model, inspect the resulting internal state 
values of the hardware model by register/combinational logic regeneration and loading values 
from the hardware model to the software model, and the user can finally simulate other parts of 
the user design in software using the results of the hardware model-accelerated process. 
5 As described above, a workstation, however, is still needed for debug session control 

purposes. In a network configuration, a workstation may be remotely coupled to the 
coverification system to access debug data remotely. In a non-network configuration, a 
workstation may be locally coupled to the coverification system or in some other embodiments, 
the workstation may incorporate the coverification system internally so that debug data can be 
^10 accessed locally, 

ill 

4^ Coverification System with External I/O 

I r In FIG. 68, the various I/O devices and target applications were modeled in the RCC 

computing system 2101, However, when too many I/O devices and target applications are 
O 15 running in the RCC computing system 2101, the overall speed slows down. With only a single 

I I CPU in the RCC computing system 2101, more time is necessary to process the various data 
^ from all the device models and target applications. To increase the data throughput, actual I/O 

devices and target applications (instead of software models of these I/O devices and target 
applications) can be physically coupled to the coverification system. 

20 One embodhnent of the present invention is a coverification system that uses actual and 

physical external I/O devices and target applications. Thus, a coverification system can 
incorporate the RCC system along with other fimctionality to debug the software portion and 
hardware portion of a user's design while using the actual target system and/or I/O devices. For 
testing, the coverification system can use both test bench data from software and stimuli from the 

25 external interface (e.g., target system and external I/O devices). Test bench data can be used to 
not only provide test data to pin-outs of the user design, but also test data to internal nodes in the 
user design. Actual I/O signals from external I/O devices (or target system) can only be directed 
to pin-outs of the user design. Thus, one main distinction between test data from an external 
interface (e.g., target system or external I/O device) and test bench processes in software is that 
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test bench data can be used to test the user design with stimulus applied to pin-outs and internal 
nodes, whereas actual data from the target system or external I/O device can only be applied to 
the user design via its pin-outs (or nodes in the user design that represent pin-outs). In the 
following discussion, the structure of the coverification system and its configuration with respect 
5 to a target system and the external I/O devices will be presented. 

As a comparison to the system configuration of FIG. 66, the coverification system in 
accordance with one embodiment of the present invention replaces the structure and functionality 
of the items in the dotted line 2070. In other words, while FIG. 66 shows the emulator and the 
workstation within the confines of the dotted line 2070, one embodiment of the present invention 
includes the coverification system 2140 (and its associated workstation) as shown in FIG. 69 as 
^3 coverification system 2140 within the dotted line 2070. 

jg Referring to FIG. 69, the coverification system configuration in accordance with one 

i2 embodiment of the present invention includes a target system 2120, a coverification system 2140, 

some optional I/O devices, and a control/data bus 2131 and 2132 for coupling them together, 
13 15 The target system 2120 includes a central computing system 2121, which includes a CPU and 
|I memory, and operates under some operating system such as Microsoft Windows or Sun 
li; Microsystem's Solaris to run a number of applications 2122 and test cases 2123. The device 
14 driver 2124 for the hardware model of the user's design is included in the central computing 

system 2121 to enable communication between the operating system (and any applications) and 
20 the user's design. To communicate with the coverification system as well as other devices which 
are part of this computing environment, the central computing system 2121 is coupled to the PCI 
bus 2129. Other peripherals in the target system 2120 include an Ethernet PCI add-on card 2125 
used to couple the target system to a network, a SCSI PCI add-on card 2126 coupled to SCSI 
drive 2128 via bus 2130, and a PCI bus bridge 2127. 
25 The coverification system 2140 includes a RCC computing system 2141, a RCC hardware 

array 2190, an external interface 2139 in the form of an external I/O expander, and a PCI bus 
2171 coupling the RCC computing system 2141 and the RCC hardware array 2190 together. The 
RCC computing system 2141 includes the CPU, memory, an operating system, and the necessary 
software to run the single-engine coverification system 2140. Importantly, the RCC computing 
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system 2141 includes the entire model of the user's design in software and the RCC hardware 
array 2190 includes a hardware model of the user's design. 

As discussed above, the single-engine of the coverification system derives its power and 
flexibility from a main software kernel which resides in the main memory of the RCC computing 
5 system 2141 and controls the overall operation and execution of the coverification system 2140. 
So long as any test bench processes are active or any signals from the external world are 
presented to the coverification system, the kernel evaluates active test bench components, 
evaluates clock components, detects clock edges to update register and memories as well as 
propagating combinational logic data, and advances the simulation time. This main software 
:SlO kernel provides for the tightly coupled nature of the RCC computing system 2141 and the RCC 
hardware array 2190. 

M The software kernel generates a software clock signal from a software clock source 2142 

|I that is provided to the RCC hardware array 2190 and the external world. The clock source 2142 
" ^ can generate multiple clocks at different frequencies depending on the destination of these 
3 15 software clocks. Generally, the software clock ensures that the registers in the hardware model 
of the user's design evaluate in synchronization with the system clock and without any hold-time 
violations. The software model can detect clock edges in software that affect hardware model 
register values. Accordingly, a clock detection mechanism ensures that a clock edge detection in 
the mam software model can be translated to clock detection the hardware model. For a more 
20 detailed discussion of software clocks and the clock-edge detection logic, refer to FIGS. 17-19 
and accompanying text in the patent specification. 

In accordance with one embodiment of the present invention, the RCC computing system 
2141 may also include one or more models of a number of I/O devices, despite the fact that other 
actual physical I/O devices can be coupled to the coverification system. For example, the RCC 
25 computing system 2141 may include a model of a device (e.g., a speaker) along with its driver 
and test bench data in software labeled as 2143, and a model of another device (e.g., a graphics 
accelerator) along with its driver and test bench data in software labeled as 2144. The user 
decides which devices (and their respective drivers and test bench data) will be modeled and 
incorporated into the RCC computing system 2141 and which devices will be actually coupled to 
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the cover ification system. 

The coverification system contams a control logic that provides traffic control between: 
(1) the RCC computing system 2141 and the RCC hardware array 2190, and (2) the external 
interface (which are coupled to the target system and the external I/O devices) and the RCC 
5 hardware array 2190. Some data passes between the RCC hardware array 2190 and the RCC 
computing system 2141 because some I/O devices may be modeled in the RCC computing 
system. Furthermore, the RCC computing system 2141 has the model of the entire design in 
software, including that portion of the user design modeled in the RCC hardware array 2190. As 
a result, the RCC computing system 2141 must also have access to all data that passes between 

%10 the external interface and the RCC hardware array 2190. The control logic ensures that the RCC 
computing system 2141 has access to these data. The control logic will be described in greater 

5 detail below. 

|.^^ The RCC hardware array 2190 includes a number of array boards. In this particular 

f ' embodiment shown in FIG. 69, the hardware array 2190 includes boards 2145-2149. Boards 
O 15 2146-2149 contain the bulk of the configured hardware model. Board 2145 (or board ml) 
H contains a reconfigurable computing element (e.g., FPGA chip) 2153, which the coverification 
1^ system can use to configure at least a portion of the hardware model, and an external I/O 
^ controller 2152 which directs traffic and data between the external interface (target system and 
I/O devices) and the coverification system 2140. Board 2145, via the external I/O controller, 
20 allows the RCC computing system 2141 to have access to all data transported between the 
external world (i.e., target system and I/O devices) and the RCC hardware array 2190. This 
access is important because the RCC computing system 2141 in the coverification system contains 
a model of the entire user design in software and the RCC computing system 2141 can also 
control the ftinctionality of the RCC hardware array 2190. 
25 If stimulus from an external I/O device is provided to the hardware model, the software 

model must also have access to this stimulus as well so that the user of this coverification system 
can selectively control the next debug step, which may include inspecting internal state values of 
his design as a result of this applied stimulus. As discussed above with respect to the board layout 
and interconnection scheme, the first and last board are included in the hardware array 2190. 
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Thus, board 1 (labeled as board 2146) and board 8 (labeled as board 2149) are included in an 
eight-board hardware array (excluding board ml). Other than these boards 2145-2149, board m2 
(not shown m FIG. 69, but see FIG. 74) may also be provided having chip m2. This board ml is 
similar to board ml except that board m2 does not have any external interface and can be used 
5 for expansion purposes if additional boards are necessary. 

The contents of these boards will now be discussed. Board 2145 (board ml) includes a PCI 
controller 2151, an external I/O controller 2152, data chip (ml) 2153, memory 2154, and 
multiplexer 2155. In one embodiment, this PCI controller is a PLX 9080. The PCI controller 2151 
is coupled to the RCC computing system 2141 via bus 2171 and a tri-state buffer 2179 via bus 
JilO 2172. 

;| The main traffic controller in the coverification system between the external world (target 

system 2120 and I/O devices) and the RCC computing system 2141 is an external I/O controller 
2152 (also known as "CTRLXM" in FIGS. 69, 71, and 73), which is coupled to the RCC 
computing system 2141, the other boards 2146-2149 in the RCC hardware array, the target 
C3 15 system 2120, and the actual external I/O devices. Of course, the main traffic controller between 
the RCC computing system 2141 and the RCC hardware array 2190 has always been the 
combination of the individual internal I/O controllers (e.g., I/O controllers 2156 and 2158) in 
each array board 2146-2149 and the PCI controller 2151, as described above. In one 
embodiment, these individual internal I/O controllers, such as controllers 2156 and 2158, are the 
20 FPGA I/O controllers described and illustrated above in such exemplary figures as FIG. 22 (unit 
700) and FIG. 56 (unit 1200). 

The external I/O controller 2152 is coupled to the tri-state buffer 2179 to allow the 
external I/O controller to interface with the RCC computing system 2141 . In one embodunent, 
the tri-state buffer 2179 allows data from the RCC computing system 2141 to pass to the local 
25 bus 2180 while preventing data from the local bus to pass to the RCC computing system 2141 in 
some instances, and allows data to pass from the local bus 2180 to the RCC computuig system 
2141 in other instances. 

The external I/O controller 2152 is also coupled to chip (ml) 2153 and memory/external 
buffer 2154 via data bus 2176. In one embodunent, chip (ml) 2153 is a reconfigurable 
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computing element, such as an FPGA chip, that can be used to configure at least a portion of the 
hardware model of the user design (or all of the hardware model, if the user design is small 
enough). External buffer 2154 is a DRAM DIMM in one embodiment and can be used by chip 
2153 for a variety of purposes. The external buffer 2154 provides a lot of memory capacity, 
5 more than the individual SRAM memory devices coupled locally to each reconfigurable logic 
element (e.g., reconfigurable logic element 2157). This large memory capacity allows the RCC 
computing system to store large chunks of data such as test bench data, embedded code for 
microcontrollers (if the user design is a microcontroller), and a large look-up table in one 
memory device. The external buffer 2154 can also be used to store data necessary for the 
fSlO hardware modeling, as described above. In essence, this external buffer 2154 can partly function 
like the other high or low bank SRAM memory devices described and illustrated above in, for 
example, FIG. 56 (SRAM 1205 and 1206) but with more memory. External buffer 2154 can also 
be used by the co verification system to store data received from the target system 2120 and the 
external I/O devices so that these data can later be retrieved by the RCC computing system 2141. 
3 15 Chip ml 2153 and external buffer 2154 also contain the memory mapping logic described in the 
patent specification herein under the section called "Memory Simulation." 

To access the desired data in the external buffer 2154, both the chip 2153 and the RCC 
computing system 2141 (via the external I/O controller 2152) can deliver the address for the 
desired data. The chip 2153 provides the address on address bus 2182 and the external I/O 
20 controller 2152 provides the address on address bus 2177. These address buses 2182 and 2177 
are inputs to a multiplexer 2155, which provides the selected address on output line 2178 coupled 
to the external buffer 2154. The select signal for the multiplexer 2155 is provided by the external 
I/O controller 2152 via line 218L 

The external I/O controller 2152 is also coupled to the other boards 2146-2149 via bus 
25 2180. In one embodiment, bus 2180 is the local bus described and illustrated above in such 

exemplary figures as FIG. 22 (local bus 708) and FIG. 56 (local bus 1210). In this embodiment, 
only five boards (including board 2145 (board ml)) are used. The actual number of boards is 
determined by the complexity and magnitude of the user's design that will be modeled in 
hardware. A hardware model of a user design that is of medium complexity requires less boards 
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than a hardware model of a user design that is of higher complexity. 

To enable scalability, the boards 2146-2149 are substantially identical to each other except 
for some inter-board interconnect lines. These interconnect lines enable one portion of the 
hardware model of the user's design in one chip (e.g., chip 2157 in board 2146) to communicate 
5 with another part of the hardware model in the same user's design that is physically located in 
another chip (e.g., chip 2161 in board 2148). Briefly refer to FIG. 74 for the interconnect 
structure for this coverification system, as well as FIGS. 8 and 36-44 and their accompanying 
descriptions in this patent specification. 

Board 2148 is a representative board. Board 2148 is the third board in this four-board 
^if 10 layout (excluding board 2145 (board ml)). Accordingly, it is not an end-board that needs 
yO appropriate terminations for the intercoimect lines. Board 2148 includes an internal I/O 
Jr controller 2158, several reconfigurable logic elements (e.g., FPGA chips) 2159-2166, high bank 
i2 FD bus 2167, low bank FD bus 2168, high bank memory 2169, and low bank memory 2170. As 
^ stated above, the internal I/O controller 2158 is, in one embodiment, the FPGA I/O controller 
O 15 described and illustrated above in such exemplary figures as FIG. 22 (unit 700) and FIG. 56 (unit 
2 1200). Similarly, the high and low bank memory devices 2169 and 2170 are the SRAM memory 

devices described and illustrated above in, for example, FIG. 56 (SRAM 1205 and 1206). The 
1^^ high and low bank FD buses 2167 and 2168 are, in one emboduBent, the FD bus or FPGA bus 
described and illustrated above in such exemplary figures as FIG. 22 (FPGA bus 718 and 719), 
20 FIG. 56 (FD bus 1212 and 1213), and FIG. 57 (FD bus 1282). 

To couple the coverification system 2140 to the target system 2120 and other I/O devices, 
an external interface 2139 in the form of an external I/O expander is provided. On the target 
system side, the external I/O expander 2139 is coupled to the PCI bridge 2127 via secondary PCI 
bus 2132 and a control line 2131, which is used to deliver the software clock. On the I/O device 
25 side, the external I/O expander 2139 is coupled to various I/O devices via buses 2136-2138 for 
pin-out data and control lines 2133-2135 for the software clock. The number of I/O devices that 
can be coupled to the I/O expander 2139 is determined by the user. In any event, as many data 
buses and software clock control lines are provided in the external I/O expander 2139 as are 
necessary to couple as many I/O devices to the coverification system 2140 to run a successftil 
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debug session. 

On the CO verification system 2140 side, the external I/O expander 2139 is coupled to the 
external I/O controller 2152 via data bus 2175, software clock control line 2174, and scan control 
line 2173. Data bus 2175 is used to pass pin-out data between the external world (target system 
5 2120 and external I/O devices) and the coverification system 2140. Software clock control line 
2174 is used to deliver the software clock data fi:om the RCC computing system 2141 to the 
external world. 

The software clock present on control lines 2174 and 2131 is generated by the main 
software kernel in the RCC computing system 2141. The RCC computing system 2141 delivers a 
^ :10 software clock to external I/O expander 2139 via the PCI bus 2171, PCI controller 2151, bus 
-1 2171, tri-state buffer 2179, local bus 2180, external I/O controller 2152, and control line 2174. 

From the external I/O expander 2139, the software clock is provided as the clock input to the 
. I target system 2120 (via the PCI bridge 2127), and other external I/O devices via control lines 
' 2133-2135. Because the software clock functions as the main clock source, the target system 
„ J 15 2120 and the I/O devices run at a slower speed. However, the data provided to the target system 
11 2120 and the external I/O devices are synchronized to the software clock speed like the software 
J4 model in the RCC computing system 2141 and the hardware model in the RCC hardware array 
U 2190. Similarly, data from the target system 2120 and the external I/O devices are delivered to 
the coverification system 2140 in synchronization with the software clock. 
20 Thus, I/O data passed between the external interface and the coverification system are 

synchronized with the software clock. Essentially, the software clock synchronizes the operation 
of the external I/O devices and the target system with that of the coverification system (in the 
RCC computing system and the RCC hardware array) whenever data passes between them. The 
software clock is used for both data-in operations and data-out operations. For data-in 
25 operations, as a pointer (to be discussed later) latches the software clock from the RCC 

computing system 2141 to the external interface, other pointers will latch these I/O data in from 
the external interface to selected internal nodes in the hardware model of the RCC hardware array 
2190. One by one, the pointers will latch these I/O data in during this cycle when the software 
clock was delivered to the external interface. When all data have been latched in, the RCC 
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computing system can generate another software clock again to latch in more data at another 
software clock cycle, if desired. For data-out operations, the RCC computing system can deliver 
the software clock to the external interface and subsequently control the gating of data from the 
internal nodes of the hardware model in the RCC hardware array 2190 to the external interface 
5 with the aid of pointers. Again, one by one, the pointers will gate data from the internal nodes to 
the external interface. If more data needs to be delivered to the external interface, the RCC 
computing system can generate another software clock cycle and then activate selected pointers to 
gate data out to the external interface. The generation of the software clock is strictly controlled 
and thus allows the coverification system to synchronize data delivery and data evaluation 

^JlO between the coverification system and any external I/O devices are coupled to the external 

a interface. 

^£ Scan control line 2173 is used to allow the coverification system 2140 to scan the data 

buses 2132, 2136, 2137, and 2138 for any data that may be present. The logic in the external 

VI I/O controller 2151 supporting the scan signal is a pointer logic where various inputs are 

a 15 provided as outputs for a specific time period before moving on to the next input via a MOVE 
signal. This logic is analogous to the scheme shown in FIG, 11. In effect, the scan signal 

;^ functions like a select signal for a multiplexer except that it selects the various inputs to the 

multiplexer in round robin order. Thus, in one time period, the scan signal on scan control line 
2173 samples data bus 2132 for data that may be coming from the target system 2120. At the 
20 next time period, the scan signal on scan control line 2173 samples data bus 2136 for data that 
may be coming an external I/O device that may be coupled there. At the next tune period, data 
bus 2137 is sampled, and so on, so that the coverification system 2140 can receive and process all 
pin-out data that originated from the target system 2120 or the external I/O devices during this 
debug session. Any data that is received by the coverification system 2140 from sampling the 
25 data buses 2132, 2136, 2137, and 2138 are transported to the external buffer 2154 via the 
external I/O controller 2152, 

Note that the configuration illustrated in FIG. 69 assumes that the target system 2120 
contains the primary CPU and the user design is some peripheral device, such as a video 
controller, network adapter, graphics adapter, mouse, or some other support device, card, or 
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logic. Thus, the target system 2120 contains the target applications (including the operating 
system) coupled to the primary PCI bus 2129, and the co verification system 2140 contains the 
user design and is coupled to the secondary PCI bus 2132. The configuration may be quite 
different depending on the subject of the user design. For example, if the user design was a 
5 CPU, the target application would run in the RCC computing system 2141 of the coverification 
system 2140 while the target system 2120 would no longer contain the central computing system 
2121. Indeed, the bus 2132 would now be a primary PCI bus and bus 2129 would be a 
secondary PCI bus. In effect, instead of the user design being one of the peripheral devices 
supporting the central computing system 2121, the user design is now the main computing center 
1^10 and all other peripheral devices are supporting the user design. 

^2 The control logic for transporting data between the external interface (external I/O 

.p expander 2139) and the coverification system 2140 is found in each board 2145-2149. The 

i2 primary portion of the control logic is found in the external I/O controller 2152 but other portions 

are found in the various internal I/O controllers (e.g., 2156 and 2158) and the reconfigurable 
D 15 logic elements (e.g., FPGA chips 2159 and 2165). For instructional purposes, it is necessary 
ii only to show some portion of this control logic instead of the same repetitive logic structure for 
iSj all chips in all boards. The portion of the coverification system 2140 within the dotted line 2150 
of FIG. 69 contains one subset of the control logic. This control logic will now be discussed in 
greater detail with respect to FIGS. 70-73. 
20 The components in this particular subset of the control logic include the external I/O 

controller 2152, the tri-state buffer 2179, internal I/O controller 2156 (CTRL 1), the 
reconfigurable logic element 2157 (chipO l, which indicates chip 0 of board 1), and parts of 
various buses and control lines which are coupled to these components. Specifically, FIG. 70 
illustrates that portion of the control logic that is used for data-in cycles, where the data from the 
25 external interface (external I/O expander 2139) and the RCC computing system 2141 are 

delivered to the RCC hardware array 2190. FIG. 72 illustrates the timing diagram of the data-in 
cycles. FIG. 71 illustrates that portion of the control logic that is used for data-out cycles, where 
data from the RCC hardware array 2190 are delivered to the RCC computing system 2141 and 
the external interface (external I/O expander 2139). FIG. 73 illustrates the timing diagram of the 
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im ' " III) 



data-out cycles. 



Data-in 

The data-in control logic in accordance with one embodiment of the present invention is 
5 responsible for handling the data delivered from either the RCC computing system or the external 
interface to the RCC hardware array. One particular subset 2150 (see FIG. 69) of the data-in 
control logic is shown in FIG. 70 and includes the external I/O controller 2200, tri-state buffer 
2202, internal I/O controller 2203, reconfigurable logic element 2204, and various buses and 
control lines to allow data transport therebetween. The external buffer 2201 is also shown for 
';J10 this data-in embodunent. This subset illustrates the logic necessary for data-in operations, where 
;3 the data from the external interface and the RCC computing system are delivered to the RCC 
.2 hardware array. The data-in control logic of FIG. 70 and the data-in timing diagram of FIG. 72 
i2 will be discussed together. 

^- Two types of data cycles are used in this data-in embodiment of the present invention - a 
0 15 global cycle and a software-to-hardware (S2H) cycle. The global cycle is used for any data that is 
^ directed to all the chips in the RCC hardware array such as clocks, resets, and some other S2H 
VZ data directed at many different nodes in the RCC hardware array. For these latter "global" S2H 
1.^ data, it is more feasible to send these data out via the global cycles than the sequential S2H data. 

The software-to-hardware cycle is used to send data from the test bench processes in the 
20 RCC computing system to the RCC hardware array sequentially from one chip to another in all 
the boards. Because the hardware model of the user design is distributed across several boards, 
the test bench data must be provided to every chip for data evaluation. Thus, the data is 
delivered sequentially to each internal node m each chip, one internal node at a time. The 
sequential delivery allows a particular data designated for a particular internal node to be 
25 processed by all the chips in the RCC hardware array since the hardware model is distributed 
among a plurality of chips. 

For this data evaluation, the coverification provides two address spaces - S2H and CLK. 
As described above, the S2H and CLK space are the prunary input from the kernel to the 
hardware model. The hardware model holds substantially all the register components and the 
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combinational components of the user's circuit design. Furthermore, the software clock is 
modeled in software and provided in the CLK I/O address space to interface with the hardware 
model. The kernel advances simulation time, looks for active test-bench components, and 
evaluates clock components. When any clock edge is detected by the kernel, registers and 
5 memories are updated and values through combinational components are propagated. Thus, any 
changes in values in these spaces will trigger the hardware model to change logic states if the 
hardware acceleration mode is selected. 

Durmg data transfer, the DATA XSFR signal is at logic 1 . " During this time, the local 
bus 2222-2230 will be used by the coverification system to transport data with the following data 
yiO cycles: (1) global data from the RCC computing system to the RCC hardware array and the CLK 
^ space; (2) global data from the external interface to the RCC hardware array and the external 
^ buffer; and (3) S2H data from the RCC computing system to the RCC hardware array, one chip 
i2 at a time in each board. Thus, the first two data cycles are part of the global cycle and the last 
W data cycle is part of the S2H cycle. 

C3 15 For the first part of the data-in global cycle where the global data from the RCC 

I'T computing system is sent to the RCC hardware array, the external I/O controller 2200 enables a 

CPU IN signal to logic " 1 " on line 2255. Line 2255 is coupled to an enable input of the tri-state 
H buffer 2202. With logic "1" on line 2255, the triOstate buffer 2202 allows data on the local bus 
2222 to pass to the local buses 2223-2230 on the other side of the tri-state buffer 2202. In this 
20 particular example, local buses 2223, 2224, 2225, 2226, 2227, 2228, 2229, and 2230 correspond 
to LD3, LD4 (from the external I/O controller 2200), LD6 (from the external I/O controller 
2200), LDl, LD6, LD4, LD5, and LD7, respectively. 

The global data fravels from these local bus lines to bus lines 2231-2235 in the internal 
I/O controller 2203 and then to the FD bus lines 2236-2240. In this example, the FD bus lines 
25 2236, 2237, 2238, 2239, and 2240 correspond to FD bus lines FDl, FD6, FD4, FD5, and FD7, 
respectively. 

These FD bus lines 2236-2240 are coupled to the inputs to latches 2208-2213 in the 
reconfigurable logic element 2204. In this example, the reconfigurable logic element corresponds 
to chip0_l (i.e., chip 0 in board 1). Also, FD bus line 2236 is coupled to latch 2208, FD bus 
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line 2237 is coupled to latches 2209 and 2211, FD bus line 2238 is coupled to latch 2210, FD bus 
line 2239 is coupled to latch 2212, and FD bus line 2240 is coupled to latch 2213. 

The enable inputs for each of these latches 2208-2213 are coupled to several global 
pointers and software-to-hardware (S2H) pointers. The enable inputs to latches 2208-2211 are 
5 coupled to the global pointers and the enable inputs to latches 2212-2213 are coupled to S2H 
pointers. Some exemplary global pointers include GLB_PTR0 on line 2241, GLB_PTR1 on line 
2242, GLB_PTR2 on line 2243, and GLB_PTR3 on Ime 2244. Some exemplary S2H pointers 
include S2H_PTR0 on line 2245 and S2H_PTR1 on line 2246. Because the enable inputs to these 
latches are coupled to these pointers, the respective latches cannot latch data to their intended 
QlO destination nodes in the hardware model of the user design without the proper pointer signals. 
i§ These global and S2H pointer signals are generated by a data-in pointer state machine 

2214 on output 2254. The data-in pointer state machine 2214 is controlled by the DATA_XSFR 
and F_WR signals on line 2253. The internal I/O controller 2203 generates the DATA XSFR 
in and F WR signals on line 2253. The DATA_XSFR is always at logic "1" whenever data 
□ 15 transfer between the RCC hardware array and either the RCC computing system or the external 
interface is desired. The F WR signal, in contrast to the F RD signal, is at logic "1" whenever 
fy a write to the RCC hardware array is desired. A read via the F RD signal requires the delivery 
li of data ft-om the RCC hardware array to either the RCC computing system and the external 

interface. If both the DATA XSFR and F WR signals are at logic "1," the data-in pointer state 
20 machine can generate the proper global or S2H pointer signals at the proper programmed 
sequence. 

The outputs 2247-2252 of these latches are coupled to various internal nodes in the 
hardware model of the user design. Some of these internal nodes correspond to input pin-outs of 
the user design. The user design has other internal nodes that are normally not accessible via 
25 pin-outs but these non-pin-out internal nodes are for other debugging purposes to provide 
flexibility for the designer who desires to apply stimuli to various internal nodes in the user 
design, regardless of whether they are input pin-outs or not. For stimuli applied by the external 
interface to the elaborate hardware model of the user design, the data-in logic and those internal 
nodes corresponding to input pin-outs are implicated. For example, if the user design is a CRTC 
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6845 video controller, some input pin-outs may be as follows: 

LPSTB - a light pen strobe pin 
-RESET - low level signal to reset the 6845 controller 
5 RS - register select 

E - enable 
CLK - clock 
-CS- Chip select 

^rlO Other input pin-outs are also available in this video controller. Based on the number of 

S input pin-outs that interface to the outside world, the number of nodes and hence, the number of 
,g latches and pointers can be readily determined. Some hardware model configured in the RCC 
/: hardware array may have, for example, thirty separate latches associated with each of 
in GLB_PTRO, GLBJ>TR1, GLB_PTR2, GLB_PTR3, S2H_PTR0, and S2H_PTR1 for a total of 
Q 15 180 latches (==30x6). In other designs, more global pointers such as GLB_PTR4 to GLB PTRSO 
{2 may be used as necessary. Similarly, more S2H pointers such as S2H_PTR2 to S2H_PTR30 may 

be used as necessary. These pointers and their corresponding latches are based on the 
H requirements of the hardware model of each user design. 

Returning to FIGS. 70 and 72, the data on the FD bus lines make their way to these 
20 internal nodes only if the latches are enabled with the proper global pointer or S2H pointer 

signal. Otherwise, these internal nodes are not driven by any data on the FD bus. When F_WR 
is at logic " 1" during the first half of the CPUJN- 1 time period, GLB_PTRO is at logic "1" to 
drive the data on FDl to the corresponding internal node via line 2247. If other latches exist that 
depend on GLB PTRO for enabling, these latches will also latch data to their corresponding 
25 internal nodes. In the second half of the CPU__IN= 1 time period, F_WR goes to logic "1" again 
which triggers GLB_PTR1 to rise to logic "1." This drives the data on FD6 to the internal node 
coupled to line 2248. This also sends the software clock signal on line 2223 to be latched to line 
2216 by latch 2205 and GLB_PTR1 signal on enable line 2215. This software clock is delivered 
to the external clock inputs to the target system and other external I/O devices. Since 
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GLB_PTRO and GLB_PTR1 are used only for the first part of the data-in global cycle, CPU JN 
returns to logic "0" and this completes the delivery of global data from the RCC computing 
system to the RCC hardware array. 

The second part of the data-in global cycle will now be discussed, where global data from 
5 the external interface are delivered to the RCC hardware array and the external buffer. Again, 
the various input pin-out signals from either the target system or the external I/O devices that are 
directed at the user design must be provided to the hardware model and the software modeL 
These data can be delivered to the hardware model by using the appropriate pointers and latched 
to drive the internal nodes. These data are also delivered to the software model by first storing 

^JIO them in the external buffer 2201 for later retrieval by the RCC computing system to update the 

^0 internal states of the software model. 

J CPUJN is now at logic "0" and EXTJN is at logic "1." Accordingly, the tri-state 

i2 buffer 2206 in the external I/O controller 2200 is enabled to let the data on such PCI bus lines as 

bus lines 2217 and 2218. These PCI bus lines are also coupled to FD bus lines 2219 for storage 
|3 15 in the external buffer 2201 . In the first half of the time period when the EXT_IN signal is at 
S logic "1," GLB_PTR2 is at logic "1." This latches the data on FD4 (via bus lines 2217, 2224, 
If and local bus line 2228 (LD4)) to be latched to the internal node in the hardware model coupled 

to line 2249. 

During the second half of the time period when the EXT IN signal is at logic "1," 
20 GLB_PTR3 is at logic " 1 . " This latches the data on FD6 (via bus lines 2218, 2225, and local 
bus line 2227 (LD6)) to be latched to the internal node in the hardware model coupled to line 
2250. 

As stated above, these data from the target system or some other external I/O devices are 
also delivered to the software model by first storing them in the external buffer 2201 for later 
25 retrieval by the RCC computing system to update the internal states of the software model. These 
data on bus lines 2217 and 2218 are provided on FD bus FD[63:0] 2219 to external buffer 2201. 
The particular memory address each data is stored in the external buffer 2201 is provided by 
memory address counter 2207 via bus 2220 to the external buffer 2201. To enable such storage, 
the WR EXT BUF signal is provided to the external buffer 2201 via line 2221. Before the 
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external buffer 2201 is full, the RCC computing system will read the contents of the external 
buffer 2201 so that appropriate updates can be made to the software model. Any data that was 
delivered to the various internal nodes of the hardware model in the RCC hardware array will 
probably result in some internal state changes in the hardware model. Because the RCC 
5 computing system has the model of the entire user design in software, these internal state changes 
in the hardware model should also be reflected in the software model. This concludes the data-in 
global cycle. 

The S2H cycle will now be discussed. The S2H cycle is used to deliver test bench data 
from the RCC computing system to the RCC hardware array, and then move that data 
sequentially from one chip to the next for each board. The CPU IN signal goes to logic "1" 

tfl while the EXT IN signal goes to logic "0" indicating that the data transfer is between the RCC 
computing system and the RCC hardware array. The external interface is not involved. The 

^2 CPU IN signal also enables the tri-state buffer 2202 to allow data to pass from the local bus 2222 

W to the internal I/O controller 2203. 

b l5 In the beginning of the CPUJN= 1 time period, S2H_PTR0 goes to logic "1" which 

2 latches the data on FD5 (via local bus 2222, local bus line 2229, bus line 2234, and FD bus 
^ 2239) to be latched to the internal node in the hardware model coupled to line 2251. In the 
1-^ second part of the CPUJN = 1 time period, S2H PTRl goes to logic "1" which latches the data 
on FD7 (via local bus 2222, local bus line 2230, bus line 2235, and FD bus 2240) to be latched 
20 to the internal node in the hardware model coupled to line 2252. During the sequential data 
evaluation, the data from the RCC computing system is delivered to chip ml first, then chipO_l 
(i.e., chip 0 on board 1), chipl l (i.e., chip 1 on board 1), until the last chip on the last board, 
chip7_8 (i.e., chip 7 on board 8). If chip m2 is available, the data is also moved into this chip as 
well. 

25 At the end of this data transfer, the DATA XSFR returns to logic " 0. " Note that the I/O 

data from the external interface is treated as global data and handles during global cycles. This 
concludes the discussion of the data-in control logic and the data-in cycles. 
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Data-out 



The data-out control logic embodiment of the present invention will now be discussed. 
The data-out control logic in accordance with embodiment of the present invention is responsible 
for handling the data delivered from the RCC hardware array to the RCC computing system and 
5 the external interface. During the course of processing data in response to stimuli (external or 
otherwise), the hardware model generates certain output data that the target application(s) or 
some I/O devices may need. These output data may be substantive data, address, control 
information, or other relevant information that another application or device may need for its own 
processing. These output data to the RCC computing system (which may have models of other 
^alO external I/O devices in software), the target system, or external I/O devices are provided on 
various internal nodes. As discussed above with respect to the data-in logic, some of these 
M internal nodes correspond to output pin-outs of the user design. The user design has other 

internal nodes that are normally not accessible via pin-outs but these non-pin-out internal nodes 
are for other debugging purposes to provide flexibility for the designer who desires to read and 
0 15 analyze stimuli responses at various internal nodes in the user design, regardless of whether they 
are output pin-outs or not. For stimuli applied to the external interface or the RCC computing 
system (which may have models of other I/O devices in software) from the elaborate hardware 
model of the user design, the data-out logic and those internal nodes corresponding to output pin- 
outs are implicated. 

20 For example, if the user design is a CRTC 6845 video controller, some output pin-outs 

may be as follows: 



MAO-MA 13 - memory address 



D0-D7 - data bus 



25 



DE - display enable 
CURSOR - cursor position 
VS - vertical synchronization 
HS - horizontal synchronization 
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Other output pin-outs are also available in this video controller. Based on the number of 
output pin-outs that interface to the outside world, the number of nodes and hence, the number of 
gating logic and pointers can be readily determined. Thus, the output pin-outs MAO-MA 13 on 
the video controller provide the memory addresses for the video RAM. The VS output pin-out 
5 provides the signal for the vertical synchronization, and thus causes a vertical retrace on the 
monitor. The output pin-outs D0-D7 are the eight terminals v^hich form the bi-directional data 
bus for accessing the internal 6845 registers by the CPU in the target system. These output pin- 
outs correspond to certain internal nodes in the hardware model. Of course, the number and 
nature of these internal nodes vary depending on the user design. 
^'plO The data from these output pin-out internal nodes must be provided to the RCC computing 

f£ system because the RCC computing system contains a model of the entire user design in software 
J and any event that occurs in the hardware model must be communicated to the software model so 
that corresponding changes may be made. In this way, the software model will have information 
f ^ consistent with that in the hardware model. Additionally, the RCC computing system may have 
0 15 device models of I/O devices that the user or designer decided to model in software rather than 

connect an actual device to one of the ports on the external I/O expander. For example, the user 
}i may have decided that it is easier and more effective to model the monitor or speaker in software 
rather than plug an actual monitor or speaker m one of the external I/O expander ports. 
Furthermore, the data fi*om these internal nodes in the hardware model must be provided to the 
20 target system and any other external I/O devices. In order for data in these output pin-out 
internal nodes to be delivered to the RCC computing system as well as the target system and 
other external I/O devices, the data-out control logic in accordance with one embodiment of the 
present invention is provided hi the coverification system. 

The data-out control logic employ data-out cycles that involve the transport of data from 
25 the RCC hardware array 2190 to the RCC computing system 2141 and the external interface 
(external I/O expander 2139). In FIG. 69, the control logic for transporting data between the 
external interface (external I/O expander 2139) and the coverification system 2140 is found in 
each board 2145-2149. The primary portion of the control logic is found in the external I/O 
controller 2152 but other portions are found in the various internal I/O controllers (e.g., 2156 
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and 2158) and the reconfigurable logic elements (e.g., FPGA chips 2159 and 2165). Again, for 
instructional purposes, it is necessary only to show some portion of this control logic instead of 
the same repetitive logic structure for all chips in all boards. The portion of the coverification 
system 2140 within the dotted line 2150 of FIG. 69 contains one subset of the control logic. This 
5 control logic will now be discussed in greater detail with respect to FIGS. 71 and 73. FIG. 71 
illustrates that portion of the control logic that is used for data-out cycles. FIG. 73 illustrates the 
timing diagram of the data-out cycles. 

One particular subset of the data-out control logic is shown in FIG. 71 and includes the 
external I/O controller 2300, tri-state buffer 2301, internal I/O controller 2302, a reconfigurable 

''210 logic element 2303, and various buses and control lines to allow data transport therebetween. 

jfl This subset illustrates the logic necessary for data-out operations, where the data from the 

external interface and the RCC computing system are delivered to the RCC hardware array. The 

il data-out control logic of FIG. 71 and the data-out timing diagram of FIG. 73 will be discussed 
together. 

Q 15 In contrast to the two cycle types of the data-in cycles, the data-out cycle includes only 

one type of cycle. The data-out control logic requires that the data from the RCC hardware 
5^ model be sequentially delivered to: (1) RCC computing system, and then (2) the RCC computing 
^ system and the external interface (to the target system and the external I/O devices). Specifically, 

the data-out cycle requires that data from the internal nodes of the hardware model in the RCC 
20 hardware array be delivered to the RCC computing system first, and then to the RCC computing 

system and the external interface second in each chip, one chip at a time in each board and one 

board at a tune. 

Like the data-in control logic, pointers will be used to select (or gate) data from the 
internal nodes to the RCC computing system and the external interface. In one embodiment 
25 illustrated in FIGS. 71 and 73, a data-out pointer state machine 2319 generates five pointers 
H2S_PTR[4:0] on bus 2359 for both the hardware-to-software data and hardware-to-external 
interface data. The data-out pointer state machine 2319 is controlled by the DATA XSFR and 
F RD signals on line 2358. The internal I/O controller 2302 generates the DATA_XSFR and 
F_RD signals on line 2358. The DATA XSFR is always at logic T whenever data transfer 
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between the RCC hardware array and either the RCC computing system or the external interface 
is desired. The F RD signal, in contrast to the F_WR signal, is at logic "1" whenever a read 
from the RCC hardware array is desired. If both the DATA XSFR and F RD signals are at 
logic "1," the data-out pointer state machine 2319 can generate the proper H2S pointer signals at 
5 the proper programmed sequence. Other embodiments may employ more pointer (or less 
pointers) as necessary for the user design. 

These H2S pointer signals are provided to a gating logic. One set of mputs 2353-2357 to 
the gating logic is directed to several AND gates 2314-2318. The other set of inputs 2348-2352 
are coupled to the internal nodes of the hardware model. Thus, AND gate 2314 has input 2348 
OlO from an internal node and input 2353 from H2S_PTR0; AND gate 2315 has mput 2349 from an 
Cl internal node and input 2354 from H2S_PTR1; AND gate 2316 has input 2350 from an internal 
£ node and input 2355 from H2S_PTR2; AND gate 2317 has input 2351 from an internal node and 
J input 2356 from H2S_PTR3; and AND gate 2318 has input 2352 from an internal node and input 
W 2357 from H2S_PTR4. Without the proper H2S PTR pointer signal, the internal nodes cannot 
Q 15 be driven to either the RCC computing system or the external interface. 

2 The respective outputs 2343-2347 of these AND gates 2314-2318 are coupled to OR gates 

W 2310-2313. Thus, AND gate output 2343 is coupled to the input of OR gate 2310; AND gate 
H output 2344 is coupled to the input of OR gate 2311; AND gate output 2345 is coupled to the 

input of OR gate 2311; AND gate output 2346 is coupled to the input of OR gate 2312; and AND 
20 gate output 2347 is coupled to the input of OR gate 2313. Note that the output 2344 of AND 

gate 2315 is not coupled to an unshared OR gate; rather, output 2344 is coupled to OR gate 2311, 
which is also coupled to output 2345 of AND gate 2316. The other inputs 2360-2366 to OR 
gates 2310-2313 can be coupled to the outputs of other AND gates (not shown), which are 
themselves coupled to other internal nodes and H2S_PTR pointers. The use of these OR gates 
25 and their particular inputs are based on the user design and the configured hardware model. 
Thus, in other designs, more pointers may be used and output 2344 from AND gate 2315 is 
coupled to a different OR gate, not OR gate 231 1 . 

The outputs 2339-2342 of OR gates 2310-2313 are coupled to FD bus lines FDO, FD3, 
FDl, and FD4. In this particular example of the user design, only four output pin-out signals 
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will be delivered to the RCC computing system and the external interface. Thus, FDO is coupled 
to the output of OR gate 2310; FD3 is coupled to the output of OR gate 2311; FDl is coupled to 
the output of OR gate 2312; and FD4 is coupled to the output of OR gate 2313. These FD bus 
lines are coupled to local bus lines 2330-2333 via internal lines 2334-2338 in the internal I/O 
5 controller 2302. In this embodiment, local bus line 2330 is LDO, local bus line 2331 is LD3, 
local bus line 2332 is LDl, and local bus line 2333 is LD4. 

To enable the data on these local bus lines 2330-2333 to be delivered to the RCC 
computing system, these local bus lines are coupled to the tri-state buffer 2301 . In its normal 
state, the tri-state buffer 2301 allows data to pass from the local bus lines 2330-2333 to the local 
C3l0 bus 2320. In contrast, during data-in, data is allowed to pass from the RCC computing system to 
3 the RCC hardware array only when the CPU JN signal is provided to the tri-state buffer 2301 . 
% To enable the data on these local bus lines 2330-2333 to be delivered to the external 

interface, lines 2321-2324 are provided. Line 2321 is coupled to line 2330 and some latch (not 
in shown) m the external I/O controller 2300; line 2322 is coupled to Ime 233 1 and some latch (not 
nl5 shown) in the external I/O controller 2300; Ime 2323 is coupled to line 2332 and latch 2305 in 

the external I/O controller 2300; and line 2324 is coupled to Ime 2333 and latch 2306 m the 
fU external I/O controller 2300. 

JJ Each output of these latches 2305 and 2306 is coupled to a buffer and then to the external 

interface, which is then coupled to the appropriate output pin-outs of the target system or the 

20 external I/O devices. Thus, the output of latch 2305 is coupled to buffer 2307 and line 2327. 
Also, the output of latch 2306 is coupled to buffer 2308 and line 2328. Another output of 
another latch (not shown) can be coupled to line 2329. In this example, lines 2327-2329 
correspond to wirel, wire4, and wire3, respectively, of the target system or some external I/O 
device. Ultimately, during a data transfer from the hardware model to the external mterface, the 

25 hardware model of the user design is configured so that the internal node coupled to line 2350 
corresponds to wire3 on line 2329, the internal node coupled to line 2351 corresponds to wirel 
on line 2327, and the internal node coupled to line 2352 corresponds to wire4 on line 2328. 
Similarly, wire3 corresponds to LD3 on line 2331, wirel corresponds to LDl on line 2332, and 
wire4 corresponds to LD4 on line 2333. 
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A look-up table 2309 is coupled to the enable inputs to these latches 2305 and 2306. The 
look-up table 2309 is controlled by the F _RD signal on line 2367 which triggers the operation of 
the look-up table address counter 2304. At each counter increment, the pointer enables a 
particular row in the look-up table 2309. If an entry (or bit) in that particular row is at logic 
5 " 1," a LUT output line that is coupled to that particular entry in the look-up table 2309 will 
enable its corresponding latch and drive the data into the external interface and ultimately, to the 
desired destination in the target system or some external I/O device. For example, LUT output 
line 2325 is coupled to the enable input to latch 2305 and LUT output line 2326 is coupled to the 
enable input to latch 2306. 
PlO In this example, rows 0-3 of the look-up table 2309 are progranmied for enabling the 

latch(es) corresponding to the output pin-out wire(s) for the internal nodes in chip ml. Similarly, 
^ rows 4-6 are programmed for enabling the latch(es) correspondmg to the output pin-out wire(s) 
y for the internal nodes in chipO l (i.e., chip 0 in board 1). In row 4, bit 3 is at logic "1." In row 
Vi 5 , bit 1 is at logic " 1 . " In row 6, bit 4 is at logic " 1 . " All other entries or bit positions are at 
f-%15 logic "0." For any given bit position (or column) in the look-up table, only one entry is at logic 

« 1" because a single output pin-out wire cannot drive multiple I/O devices. In other words, a 
fU output pin-out mternal node in the hardware model can provide data to only a single wire coupled 
to the external interface. 

As mentioned above, the data-out control logic requires that the data in each 
20 reconfigurable logic element in each chip in the RCC hardware model be sequentially delivered 
to: (1) the RCC computing system, and then (2) the RCC computing system and the external 
interface (to the target system and the external I/O devices) together. The RCC computing 
system needs these data because it has models of some I/O devices in software and for those data 
that are not intended for one of these modeled I/O devices, the RCC computing system needs to 
25 monitor them so that its internal states are consistent with that of the hardware model in the RCC 
hardware array. In this example illustrated in FIGS. 71 and 73, only seven internal nodes will be 
driven for output to the RCC computing system and the external interface. Two of those internal 
nodes are in chip ml and the other five internal nodes are m chipO l (i.e., chip 0 in board 1). Of 
course, other internal nodes in these and other chips may be required for this particular user 

290 

SV/225583.01 
16503.302504 



design but FIGS. 71 and 73 will only illustrate these seven nodes only. 

During data transfer, the DATA_XSFR signal is at logic " 1 . " During this time, the local 
bus 2330-2333 will be used by the coverification system to transport data from each chip in each 
board in the RCC hardware array sequentially to both the RCC computing system and the 
5 external interface. The DATA_XSFR and F RD signals control the operation of the data-out 
pomter state machine for generatmg the proper pointer signals H2S_PTR[4:0] to the appropriate 
gates for the output pin-out internal nodes. The F RD signal also controls the look-up table 
address counter 2304 for delivery of the internal node data to the external interface. 

The internal nodes m chip ml will be handled first. When F RD rises to logic "1" at the 
yiO beginning of the data transfer cycle, H2S_PTR0 in chip ml goes to logic " 1 . " This drives the 
iO data in those internal nodes in chip ml that rely on H2S_PTR0 to the RCC computing system via 

tri-state buffer 2301 and local bus 2320. The look-up table address counter 2304 counts and 
^ points to row 0 of look-up table 2309 to latch in the appropriate data in chip ml to the external 
^ interface. When the F RD signal goes to logic " 1" again, the data at the internal nodes that can 
015 be driven by H2S_PTR1 are delivered to the RCC computing system and the external interface. 
J J H2SJPTR1 goes to logic " 1" and in response to the second F_RD signal, the look-up table 
;| address counter 2304 counts and points to row 1 of look-up table 2309 to latch in the appropriate 
1^ data in chip ml to the external interface. 

The five internal nodes in reconfigurable logic element 2303 (i.e., chip 0_1, or chip 0 in 
20 board 1) will now be handled. In this example, data from the two internal nodes associated with 
H2S_PTR0 and H2S_PTR1 will be delivered to the RCC computing system only. Data fi:om the 
three internal nodes associated with H2S_PTR2, H2S_PTR3, and H2S_PTR4 will be delivered 
to the RCC computing system and the external interface. 

When F_RD rises to logic " 1" , H2S_PTR0 in chip 2303 goes to logic "1." This drives 
25 the data in those internal nodes m chip 2303 that rely on H2S_PTR0 to the RCC computing 

system via tri-state buffer 2301 and local bus 2320. In this example, the internal node coupled to 
line 2348 relies on H2S_PTR0 on line 2353. When the F_RD signal goes to logic "1" again, 
the data at the internal nodes that can be driven by H2S_PTR1 are delivered to the RCC 
computing system. Here, the internal node coupled to line 2349 is affected. This data is driven 
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to LD3 on line 2331 and 2322. 

When the F RD signal goes to logic "1" again, H2S_PTR2 goes to logic " 1" and the 
data at internal node that is coupled to line 2350 is provided on LD3. This data is provided to 
both the RCC computing system and the external interface. The tri-state buffer 2301 allows the 
5 data to pass to the local bus 2320 and then into the RCC computing system. As for the external 
interface, this data is driven to LD3 on line 2331 and 2322 by the enabling H2S_PTR2 signal. In 
response to the F _RD signal, the look-up table address counter 2304 counts and points to row 4 
of look-up table 2309 to latch in the appropriate data from this internal node coupled to line 2350 
to line 2329 (wire3) at the external interface. 

10 When the F_RD signal goes to logic "1" again, H2S_PTR3 goes to logic "1" and the 

data at internal node that is coupled to line 2351 is provided on LDl. This data is provided to 
both the RCC computing system and the external interface. The tri-state buffer 2301 allows the 
data to pass to the local bus 2320 and then into the RCC computing system. As for the external 
interface, this data is driven to LDl on line 2332 and 2323 by the enabling H2S_PTR3 signal. In 

15 response to the F RD signal, the look-up table address counter 2304 counts and points to row 5 
of look-up table 2309 to latch in the appropriate data from this internal node coupled to line 2351 
to line 2327 (wu:el) at the external interface. 

When the F RD signal goes to logic "1" again, H2S_PTR4 goes to logic " 1" and the 
data at mternal node that is coupled to line 2352 is provided on LD4. This data is provided to 

20 both the RCC computing system and the external interface. The tri-state buffer 2301 allows the 
data to pass to the local bus 2320 and then into the RCC computing system. As for the external 
interface, this data is driven to LD4 on line 2333 and 2324 by the enabling H2S_PTR4 signal. In 
response to the F RD signal, the look-up table address counter 2304 counts and points to row 6 
of look-up table 2309 to latch in the appropriate data from this internal node coupled to line 2352 

25 to Ime 2328 (wire4) at the external uiterface. 

This process of driving data at the internal nodes of chip ml to the RCC computing 
system first and then to both the RCC computing system and the external interface continues for 
the other chips sequentially. First, the internal nodes of chip ml were driven. Second, the 
internal nodes of chipO l (chip 2303) were driven. Next, the internal nodes, if any, of chipl l 
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will be driven. This continues until the last nodes in the last chips in the last board are driven. 
Thus, the internal nodes, if any, of chip7_8 will be driven. Finally, the internal nodes, if any, of 
chip m2 will be driven. 

Although FIG. 71 shows the data-out control logic for driving internal nodes in chip 2303 
5 only, other chips may also have internal nodes that may need to be driven to the RCC computing 
system and the external interface. Regardless of the number of internal nodes, the data-out logic 
will drive the data from the internal nodes in one chip to the RCC computing system and then at 
another cycle, drive a different set of internal nodes in the same chip to the RCC computing 
system and the external interface together. The data-out control logic then moves on to the next 
^JIO chip and performs the same two-step operation of driving data designated for the RCC computing 
^ system first and then driving data designated for the external interface to both the RCC computing 
j5 system and the external interfece. Even if the data is intended for the external interface, the RCC 
d computing system must have knowledge of that data because the RCC computing system has a 
^ model of the entire user design in software that must have internal state information that is 
13 15 consistent with that of the hardware model in the RCC hardware array. 

Is, Board layout 

1"^ The board layout of the coverification system in accordance with one embodiment of the 

present invention will now be discussed with respect to FIG. 74. The boards are installed in the 

20 RCC hardware array. The board layout is similar to that illustrated in FIGS. 8 and 36-44 and 
described in the accompanying text. 

The RCC hardware array includes six boards, in one embodiment. Board ml is coupled 
to boardl and board m2 is coupled to boardS. The coupling and arrangement of boardl, board2, 
boards, and boardS have been described above with respect to FIGS. 8 and 36-44, 

25 Board ml contams chip ml. The interconnect structure of board ml with respect to the 

other boards is such that chip ml is coupled to the South interconnects to chip 0, chip 2, chip 4, 
and chip 6 of boardl. Analogously, board m2 contains chip m2. The interconnect structure of 
board m2 with respect to the other boards is such that chip m2 is coupled to the South 
interconnects to chip 0, chip 2, chip 4, and chip 6 of boards. 

293 

SV/225583.01 
16503.302504 



X. EXAMPLES 

To illustrate the operation of one embodiment of the present invention, a hypothetical user 
circuit design will be used. In structured register transfer level (RTL) HDL code, the exemplary 
user circuit design is as follows: 

5 

module register (clock, reset, d, q); 
input clock, d, reset; 
output q; 
reg q; 

1310 

always@(posedge clock or negedge reset) 
if('-reset) 
q = 0; 
in else 
1315 q = d; 

[U endmodule 

module example; 
20 wiredl, d2, d3; 
wire ql, q2, q3; 

reg sigin; 
wire sigout; 
25 reg elk, reset; 

register regl (elk, reset, dl, ql); 
register reg2 (elk, reset, d2, q2); 
register reg3 (elk, reset, d3, q3); 
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assign dl = sigin ^ q3; 
assign d2 = ql ^ q3; 
assign d3 = q2 ^ qS; 
assign sigout = q3; 

// a clock generator 

always 

begin 

elk = 0; 

#5; 

elk - 1; 
#5; 
end 

// a signal generator 
always 
begin 
#10; 

sigin = $random; 
end 

// initialization 

initial 

begin 

reset = 0; 

sigin = 0; 

#1; 

reset =1; 

295 

SV/225583.01 
16503.302504 



#5; 

$monitor($time, " %b, %b," sigin, sigout); 
#1000 $fmish; 
end 

5 end module 

This code is reproduced in FIG. 26. The particular functional details of this circuit design 
are not necessary to understand the present invention. The reader should understand, however, 
that the user generates this HDL code to design a circuit for simulation. The circuit represented 

^10 by this code performs some function as designed by the user to respond to input signals and 

m generates an output. 

^ FIG, 27 shows the circuit diagram of the HDL code discussed with respect to FIG. 26. In 

{2 most cases, the user may actually generate a circuit diagram of this nature before representing it 
in in HDL form. Some schematic capture tools allow pictorial curcuit diagrams to be entered and, 
P 15 after processing, these tools generate the usable code. 

n As shown in FIG. 28, the Simulation system performs component type analysis. The 

;^ HDL code, originally presented in FIG. 26 as representing a user's particular circuit design, has 
H now been analyzed. The first few lines of the code beginning with "module register (clock, 

reset, d, q);" and ending with "endmodule" and further identified by reference number 900 is a 
20 register definition section. 

The next few lines of code, reference number 907, represent some wire interconnection 
information. Wire variables in HDL, as known to those ordinarily skilled in the art, are used to 
represent physical connections between structural entities such as gates. Because HDL is 
primarily used to model digital circuits, wire variables are necessary variables. Usually, "q" 
25 (e.g., ql, q2, q3) represents output wire lines and "d" (e.g., dl, d2, d3) represents input wire 
lines. 

Reference number 908 shows "sigin" which is a test-bench output. Register number 909 
shows "sigout" which is a test-bench input. 

Reference number 901 shows register components SI, S2, and S3. Reference number 902 
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shows combinational components S4, S5, S6, and S7. Note that combinational components S4-S7 
has output variables dl, d2, and d3 which are inputs to the register components SI -S3. 
Reference number 903 shows clock component S8. 

The next series of code line numbers show test-bench components. Reference number 904 
5 shows test-bench component (driver) S9. Reference number 905 shows test-bench components 
(initialization) SIO and Sll. Reference number 906 shows test-bench component (monitor) S12. 

The component type analysis is summarized in the following table: 





Component 


Type 




SI 


Register 


'SI 


S2 


Register 




S3 


Register 




S4 


Combinational 




S5 


Combinational 




S6 


Combinational 


f 5 5 


S7 


Combinational 




S8 


Clock 




S9 


Test-bench (driver) 




SIO 


Test-bench (initialization) 




Sll 


Test-bench (initialization) 




S12 


Test-bench (monitor) 



10 Based on the component type analysis, the system generates a software model for the 

entire circuit and a hardware model for the register and combinational components. SI -S3 are 
register components and S4-S7 are combinational components. These components will be 
modeled in hardware to allow the user of the SEmulation system to either simulate the entire 
circuit in software, or simulate in software and selectively accelerate in hardware. In either case, 

15 the user has control of the simulation and hardware acceleration modes. Additionally, the user 
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can emulate the circuit with a target system while still retaining software control to start, stop, 
inspect values, and assert input values cycle by cycle. 

FIG, 29 shows a signal network analysis of the same structured RTL level HDL code. As 
illustrated, S8, S9, SIO, and Sll are modeled or provided in software. S9 is essentially the test- 
5 bench process that generates the sigin signals and S12 is essentially the test-bench monitor 

process that receives the sigout signals. In this example, S9 generates a random sigin to simulate 
the circuit's. However, registers SI to S3 and combinational components 34 to S7 are modeled 
in hardware and software. 

For the software/hardware boundary, the system allocates memory space for the various 
^dO residence signals (i.e., ql, q2, q3, CLK, sigm, sigout) that will be used to interface the software 
^0 model to the hardware model. The memory space allocation is as follows in the table below: 



Signal 


Memory Address Space 


ql 


REG 


q2 


REG 


q3 


REG 


elk 


CLK 


sigin 


S2H 


sigout 


H2S 



FIG. 30 shows the software/hardware partition result for this example circuit design. 
15 FIG. 30 is a more realizable illustration of the software/hardware partition. The software side 
910 is coupled to the hardware side 912 through the software/hardware boundary 911 and the 
PCI bus 913. 

The software side 910 contains and is controlled by the software kernel. In general, the 
kernel is the main control loop that controls the operation of the overall SEmulation system. So 
20 long as any test-bench processes are active, the kernel evaluates active test-bench components, 
evaluates clock components, detects clock edges to update registers and memories as well as 
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propagate combinational logic data, and advances the simulation time. Even though the kernel 
resides in the software side, some of its operations or statements can be executed in hardware 
because a hardware model exists for those statements and operations. Thus, the software controls 
both the software and hardware models. 
5 The software side 910 includes the entne model of the user's circuit, including S1-S12. 

The software/hardware boundary portion in the software side includes I/O buffers or address 
spaces S2H, CLK, H2S, and REG. Note that driver test-bench process S9 is coupled to the S2H 
address space, monitor test-bench process S12 is coupled to the H2S address space, and the clock 
generator S8 is coupled to the CLK address space. The register S1-S3 output signals ql-q3 will 
'^SlO be assigned to REG space. 

S The hardware model 912 has a model of the combinational components S4-S7, which 

=p resides in the pure hardware side. On the software/hardware boundary portion of the hardware 
ii model 912, sigout, sigin, register outputs ql-q3, and the software clock 916 are implemented. 

In addition to the model of the user's custom circuit design, the system generates software 
C3 15 clocks and address pointers. The software clock provides signals to the enable inputs to registers 

S1-S3. As discussed above, software clocks in accordance with the present mvention elimmate 
1:5 race conditions and hold-time violation issues. When a clock edge is detected in software by the 
^ primary clock, the detection logic triggers a correspondmg detection logic in hardware. In time, 
the clock edge register 916 generates an enable signal to the register enable inputs to gate in any 
20 data residing in the input to the register. 

Address pointer 914 is also shown for illustrative and conceptual purposes. Address 
pointers are actually implemented in each FPGA chip and allow the data to be selectively and 
sequentially transferred to its destination. 

The combinational components S4-S7 are also coupled to register components SI -S3, the 
25 sigin, and the sigout. These signals travel on the I/O bus 915 to and from the PCI bus 913. 

Prior to the mapping, placement, and routing steps, a complete hardware model is shown 
in FIG. 31, excluding the address pointers. The system has not mapped the model to specific 
chips yet. Registers S1-S3 are provided coupled to the I/O bus and the combinational 
components S4-S6. Combinational component S7 (not shown in FIG. 31) is just the output q3 of 
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the register S3. The sigin, sigout, and software clock 920 are also modeled. 

Once the hardware model has been determined, the system can then map, place, and route 
the model into one or more chips. This particular example can actually be implemented on a 
single Altera FLEX lOK chip, but for pedagogic purposes, this example will assume that two 
5 chips will be required to implement this hardware model. FIG. 32 shows one particular hardware 
model-to-chip partition result for this example. 

In FIG. 32, the complete model (except for the I/O and clock edge register) is shown with 
the chip boundary represented by the dotted luie. This result is produced by the SEmulation 
system's compiler before the final configuration file is generated. Thus, the hardware model 
^Jo requires at least three wires between these two chips for wire lines 921, 922, and 923, To 

minimize the number of pins/wires needed between these two chips (chip 1 and chip 2), either 
^ another model-to-chip partition should be generated or a multiplexing scheme should be used. 
^ Analyzing this particular partition result shown in FIG. 32, the number of wires between 

these two chips can be reduced to two by moving the sigin wire line 923 from chip 2 to chip 1 . 
Q15 Indeed, FIG. 33 ilhistrates this partition. Although the particular partition m FIG. 33 appears to 
n be a better partition than the partition in FIG. 32 based solely on the number of wires, this 
Iz example will assume that the SEmulator system has selected the partition of FIG. 32 after the 
H mapping, placement, and routing operations have been performed. The partition result of FIG. 
32 will be used as the basis for generating the configuration file. 
20 FIG. 34 shows the logic patching operation for the same hypothetical example, in which 

the final realization in two chips is shown. The system used the partition result of FIG. 32 to 
generate the configuration files. The address pointers are not shown, however, for simplicity 
purposes. Two FPGA chips 930 and 940 are shown. Chip 930 includes, among other elements, 
a partitioned portion of the user's circuit design, a TDM unit 931 (receiver side), the software 
25 clock 932, and I/O bus 933. Chip 940 includes, among other elements, a partitioned portion of 
the user's circuit design, a TDM unit 941 for the transmission side, the software clock 942, and 
I/O bus 943. The TDM units 931 and 941 were discussed with respect to FIGS. 9(A), 9(B), and 
9(C). 

These chips 930 and 940 have two interconnect wires 944 and 945 that couple the 
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hardware model together. These two interconnect wires are part of the interconnections shown in 
FIG. 8. Referring to FIG. 8, one such interconnection is interconnection 611 located between 
chip F32 and F33. In one embodiment, the maximum number of wires/pins for each 
interconnection is 44. In FIG. 34, the modeled circuit needs only two wires/pins between chips 
5 930 and 940. 

These chips 930 and 940 are coupled to the bank bus 950. Because only two chips are 
implemented, both chips are in the same bank or each is residing in a different bank. Optimally, 
one chip is coupled to one bank bus and the other chip is coupled to another bank bus to ensure 
that the throughput at the FPGA interface is the same as the throughput at the PCI interface. 
'^-flO The foregoing description of a preferred embodiment of the invention has been presented 

O for purposes of ilhistration and description. It is not intended to be exhaustive or to limit the 
5 invention to the precise forms disclosed. Obviously, many modifications and variations will be 
; ^ apparent to practitioners skilled in this art. One skilled in the art will readily appreciate that 
in other applications may be substituted for those set forth herein without departing from the spirit 
p 15 and scope of the present invention. Accordingly, the invention should only be limited by the 
claims included below. 
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